Differences

This shows you the differences between two versions of the page.

--- cluster:111 [2013/01/29 14:28]
hmeij
+++ cluster:111 [2013/02/02 13:25]
hmeij [Results]
@@ Line 4: / Line 4: @@
 ===== Amber GPU Testing (EC) =====
+We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.
+==== Results ====
+  * Verified the MPI threads and GPU invocations
+  * Verified the output data
+  * pmemd.cuda.MPI errors
+  * Script used is listed at end of this page
+^ PMEMD implementation of SANDER, Release 12  ^
+|Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set|
+^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
+|Wall Time (secs)|  211  |  120  |  64  |  35  |  29  |  26  |  33  |
+  * MPI speedup near -np 24 is 8x serial
+^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
+|Wall Time (secs)|  12  |    |    |    |    |    |    |
+  * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
+  * GPU parallel unable to measure
+^AMBER BENCHMARK EXAMPLES^^^^^^
+|JAC_PRODUCTION_NVE - 23,558 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  12.87  |  80.50  |  88.76  |  103.09  |  122.45  |  ns/day  |
+|  6713.99  |  1073.23  |  973.45  |  838.09  |  705.61  |  seconds/ns  |
+|FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  3.95  |  22.25  |  27.47  |  32.56   |  39.52   |  ns/day  |
+|  21865.59  |  3883.38  |  3145.32  |  2653.65  |  2186.28  |  seconds/ns  |
+|CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  0.91   |  5.40  |  6.44  |  7.51   |   8.85  |  ns/day  |
+|  95235.87  |  15986.42  |  13406.15  |  11509.28  |  9768.23   |  seconds/ns  |
+  * 5-6x performance speed ups using one GPU versus 16 CPU cores
+  * 9-10x perrformance speedups using four GPUs versus 16 CPU cores
+==== Setup ====
+First we get some CPU based data.
+<code>
+# serial run of pmemd
+nohup $AMBERHOME/bin/pmemd -O -i mdin -o mdout -p prmtop \
+-c inpcrd -r restrt -x mdcrd </dev/null &
+# parallel run, note that you will need create the machinefile
+# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
+mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
+-O -i mdin -o mdout -p prmtop \
+-c inpcrd -r restrt -x mdcrd </dev/null &
+</code>
 The following script should be in your path ... located in ~/bin
+You need to allocate one or more GPUs for your cuda runs.
 <code>
-$ gpu-info
+node2$ gpu-info
 ====================================================
-Device 	Model 	Temperature 	Utilization
+Device  Model           Temperature     Utilization
 ====================================================
- 	Tesla C2070 	83 C 	0 %
+       Tesla K20       27 C             0 %
- 	Tesla C2070 	86 C 	0 %
+       Tesla K20       28 C             0 %
- 	Tesla C2070 	88 C 	99 %
+       Tesla K20       27 C             0 %
- 	Tesla C2070 	87 C 	99 %
+       Tesla K20       30 C             0 %
 ====================================================
 </code>
+Next we need to expose these GPUs to pmemd ...
+<code>
+# expose one
+export CUDA_VISIBLE_DEVICES="0"
+# serial run of pmemd.cuda
+nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
+-c inpcrd -r restrt -x mdcrd </dev/null &
+# parallel run, note that you will need create the machinefile
+# if -np=4 it would could contain 4 lines with the string 'localhost'
+mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.cuda.MPI \
+-O -i mdin -o mdout -p prmtop \
+-c inpcrd -r restrt -x mdcrd </dev/null &
+</code>
+You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough.
+<code>
+# expose multiple (for serial or parallel runs)
+export CUDA_VISIBLE_DEVICES="0,2"
+</code>
+==== Script ====
+<code>
+[TestDriveUser0@K20-WS]$ cat run
+#!/bin/bash
+rm -rf err out logfile mdout restrt mdinfo
+echo CPU serial
+pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout 1core.serial.log
+echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
+for i in 2 4 8 16 24 32
+do
+echo $i
+mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout ${i}core.parallel.log
+done
+echo GPU serial
+export CUDA_VISIBLE_DEVICES="2"
+pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout 1gpu.serial.log
+echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
+export CUDA_VISIBLE_DEVICES="2"
+for i in 2
+do
+echo $i
+mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout ${i}gpu.parallel.log
+done
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools