User Tools

Site Tools


cluster:111

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:111 [2013/01/29 14:44]
hmeij
cluster:111 [2013/02/04 14:28] (current)
hmeij [Results]
Line 4: Line 4:
  
 ===== Amber GPU Testing (EC) ===== ===== Amber GPU Testing (EC) =====
 +
  
 We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.
 +
 +==== Results ====
 +
 +  * Verified the MPI threads and GPU invocations
 +  * Verified the output data
 +  * pmemd.cuda.MPI errors
 +  * Script used is listed at end of this page
 +
 +^ PMEMD implementation of SANDER, Release 12  ^
 +|Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set|
 +
 +^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  211  |  120  |  64  |  35  |  29  |  26  |  33  |
 +
 +  * MPI speedup near -np 24 is 8x serial
 +
 +^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  12  |    |    |    |    |    |    |
 +
 +  * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
 +  * GPU parallel unable to measure
 +
 +^AMBER BENCHMARK EXAMPLES^^^^^^
 +|JAC_PRODUCTION_NVE - 23,558 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  12.87  |  80.50  |  88.76  |  103.09  |  122.45  |  ns/day  |
 +|  6713.99  |  1073.23  |  973.45  |  838.09  |  705.61  |  seconds/ns  |
 +|FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  3.95  |  22.25  |  27.47  |  32.56    39.52    ns/day  |
 +|  21865.59  |  3883.38  |  3145.32  |  2653.65  |  2186.28  |  seconds/ns  |
 +|CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.91    5.40  |  6.44  |  7.51     8.85  |  ns/day  |
 +|  95235.87  |  15986.42  |  13406.15  |  11509.28  |  9768.23    seconds/ns  |
 +|NUCLEOSOME_PRODUCTION - 25,095 atoms GB||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.06    2.79  |  3.65  |  3.98    ???    ns/day  |
 +|  1478614.67  |  31007.58  |  23694.29  |  21724.33  |  ???    seconds/ns  |
 +
 +
 +  * 5-6x performance speed ups using one GPU versus 16 CPU cores
 +  * 9-10x perrformance speedups using four GPUs versus 16 CPU cores
 +
 +
 +==== Setup ====
  
 First we get some CPU based data. First we get some CPU based data.
Line 16: Line 63:
  
 # parallel run, note that you will need create the machinefile # parallel run, note that you will need create the machinefile
-# if -np=4 it would would contain 4 lines with the string 'localhost'+# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
 mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \ mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
 -O -i mdin -o mdout -p prmtop \ -O -i mdin -o mdout -p prmtop \
Line 59: Line 106:
  
 </code> </code>
 +
 +
 +You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough.
  
 <code> <code>
Line 67: Line 117:
 </code> </code>
  
 +==== Script ====
  
 +<code>
 +
 +[TestDriveUser0@K20-WS]$ cat run
 +#!/bin/bash
 +rm -rf err out logfile mdout restrt mdinfo
 +
 +echo CPU serial
 +pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1core.serial.log
 +
 +echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +for i in 2 4 8 16 24 32
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}core.parallel.log 
 +done
 +
 +echo GPU serial
 +export CUDA_VISIBLE_DEVICES="2"
 +pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1gpu.serial.log
 +
 +echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +export CUDA_VISIBLE_DEVICES="2"
 +for i in 2
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}gpu.parallel.log 
 +done
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/111.1359488695.txt.gz ยท Last modified: 2013/01/29 14:44 by hmeij