Differences

This shows you the differences between two versions of the page.

--- cluster:111 [2013/01/29 19:46]
hmeij [Amber GPU Testing (EC)]
+++ cluster:111 [2013/02/04 19:28] (current)
hmeij [Results]
@@ Line 4: / Line 4: @@
 ===== Amber GPU Testing (EC) =====
 We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.
+==== Results ====
+  * Verified the MPI threads and GPU invocations
+  * Verified the output data
+  * pmemd.cuda.MPI errors
+  * Script used is listed at end of this page
+^ PMEMD implementation of SANDER, Release 12  ^
+|Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set|
+^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
+|Wall Time (secs)|  211  |  120  |  64  |  35  |  29  |  26  |  33  |
+  * MPI speedup near -np 24 is 8x serial
+^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
+|Wall Time (secs)|  12  |    |    |    |    |    |    |
+  * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
+  * GPU parallel unable to measure
+^AMBER BENCHMARK EXAMPLES^^^^^^
+|JAC_PRODUCTION_NVE - 23,558 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  12.87  |  80.50  |  88.76  |  103.09  |  122.45  |  ns/day  |
+|  6713.99  |  1073.23  |  973.45  |  838.09  |  705.61  |  seconds/ns  |
+|FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  3.95  |  22.25  |  27.47  |  32.56   |  39.52   |  ns/day  |
+|  21865.59  |  3883.38  |  3145.32  |  2653.65  |  2186.28  |  seconds/ns  |
+|CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  0.91   |  5.40  |  6.44  |  7.51   |   8.85  |  ns/day  |
+|  95235.87  |  15986.42  |  13406.15  |  11509.28  |  9768.23   |  seconds/ns  |
+|NUCLEOSOME_PRODUCTION - 25,095 atoms GB||||||
+|  16 cpu cores  | 1xK20  |  2xK20   |  3xK20  | 4xK20  |  measure  |
+|  0.06   |  2.79  |  3.65  |  3.98   |  ???   |  ns/day  |
+|  1478614.67  |  31007.58  |  23694.29  |  21724.33  |  ???   |  seconds/ns  |
+  * 5-6x performance speed ups using one GPU versus 16 CPU cores
+  * 9-10x perrformance speedups using four GPUs versus 16 CPU cores
+==== Setup ====
 First we get some CPU based data.
@@ Line 16: / Line 63: @@
 # parallel run, note that you will need create the machinefile
-# if -np=4 it would would contain 4 lines with the string 'localhost'
+# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
 mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
 -O -i mdin -o mdout -p prmtop \
@@ Line 70: / Line 117: @@
 </code>
+==== Script ====
+<code>
+[TestDriveUser0@K20-WS]$ cat run
+#!/bin/bash
+rm -rf err out logfile mdout restrt mdinfo
+echo CPU serial
+pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout 1core.serial.log
+echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
+for i in 2 4 8 16 24 32
+do
+echo $i
+mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout ${i}core.parallel.log
+done
+echo GPU serial
+export CUDA_VISIBLE_DEVICES="2"
+pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout 1gpu.serial.log
+echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
+export CUDA_VISIBLE_DEVICES="2"
+for i in 2
+do
+echo $i
+mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
+ -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
+cp mdout ${i}gpu.parallel.log
+done
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools