Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:111 [DokuWiki]

User Tools

Site Tools


cluster:111

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:111 [2013/01/29 14:40]
hmeij [Amber GPU Testing (EC)]
cluster:111 [2013/02/04 14:28]
hmeij [Results]
Line 4: Line 4:
  
 ===== Amber GPU Testing (EC) ===== ===== Amber GPU Testing (EC) =====
 +
 +
 +We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.
 +
 +==== Results ====
 +
 +  * Verified the MPI threads and GPU invocations
 +  * Verified the output data
 +  * pmemd.cuda.MPI errors
 +  * Script used is listed at end of this page
 +
 +^ PMEMD implementation of SANDER, Release 12  ^
 +|Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set|
 +
 +^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  211  |  120  |  64  |  35  |  29  |  26  |  33  |
 +
 +  * MPI speedup near -np 24 is 8x serial
 +
 +^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  12  |    |    |    |    |    |    |
 +
 +  * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
 +  * GPU parallel unable to measure
 +
 +^AMBER BENCHMARK EXAMPLES^^^^^^
 +|JAC_PRODUCTION_NVE - 23,558 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  12.87  |  80.50  |  88.76  |  103.09  |  122.45  |  ns/day  |
 +|  6713.99  |  1073.23  |  973.45  |  838.09  |  705.61  |  seconds/ns  |
 +|FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  3.95  |  22.25  |  27.47  |  32.56    39.52    ns/day  |
 +|  21865.59  |  3883.38  |  3145.32  |  2653.65  |  2186.28  |  seconds/ns  |
 +|CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.91    5.40  |  6.44  |  7.51     8.85  |  ns/day  |
 +|  95235.87  |  15986.42  |  13406.15  |  11509.28  |  9768.23    seconds/ns  |
 +|NUCLEOSOME_PRODUCTION - 25,095 atoms GB||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.06    2.79  |  3.65  |  3.98    ???    ns/day  |
 +|  1478614.67  |  31007.58  |  23694.29  |  21724.33  |  ???    seconds/ns  |
 +
 +
 +  * 5-6x performance speed ups using one GPU versus 16 CPU cores
 +  * 9-10x perrformance speedups using four GPUs versus 16 CPU cores
 +
 +
 +==== Setup ====
 +
 +First we get some CPU based data.
 +
 +<code>
 +
 +# serial run of pmemd
 +nohup $AMBERHOME/bin/pmemd -O -i mdin -o mdout -p prmtop \
 +-c inpcrd -r restrt -x mdcrd </dev/null &
 +
 +# parallel run, note that you will need create the machinefile
 +# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
 +mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
 +-O -i mdin -o mdout -p prmtop \
 +-c inpcrd -r restrt -x mdcrd </dev/null &
 +
 +</code>
  
  
 The following script should be in your path ... located in ~/bin The following script should be in your path ... located in ~/bin
 +
 +You need to allocate one or more GPUs for your cuda runs.
  
 <code> <code>
Line 31: Line 98:
 nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \ nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
 -c inpcrd -r restrt -x mdcrd </dev/null & -c inpcrd -r restrt -x mdcrd </dev/null &
- 
-# expose multiple 
-export CUDA_VISIBLE_DEVICES="0,2" 
  
 # parallel run, note that you will need create the machinefile # parallel run, note that you will need create the machinefile
Line 44: Line 108:
  
  
 +You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough.
 +
 +<code>
 +
 +# expose multiple (for serial or parallel runs)
 +export CUDA_VISIBLE_DEVICES="0,2"
 +
 +</code>
 +
 +==== Script ====
 +
 +<code>
 +
 +[TestDriveUser0@K20-WS]$ cat run
 +#!/bin/bash
 +rm -rf err out logfile mdout restrt mdinfo
 +
 +echo CPU serial
 +pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1core.serial.log
 +
 +echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +for i in 2 4 8 16 24 32
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}core.parallel.log 
 +done
 +
 +echo GPU serial
 +export CUDA_VISIBLE_DEVICES="2"
 +pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1gpu.serial.log
 +
 +echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +export CUDA_VISIBLE_DEVICES="2"
 +for i in 2
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}gpu.parallel.log 
 +done
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/111.txt ยท Last modified: 2013/02/04 14:28 by hmeij