User Tools

Site Tools


cluster:111

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:111 [2013/01/29 19:46]
hmeij [Amber GPU Testing (EC)]
cluster:111 [2013/02/04 19:28] (current)
hmeij [Results]
Line 4: Line 4:
  
 ===== Amber GPU Testing (EC) ===== ===== Amber GPU Testing (EC) =====
 +
  
 We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.
 +
 +==== Results ====
 +
 +  * Verified the MPI threads and GPU invocations
 +  * Verified the output data
 +  * pmemd.cuda.MPI errors
 +  * Script used is listed at end of this page
 +
 +^ PMEMD implementation of SANDER, Release 12  ^
 +|Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set|
 +
 +^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  211  |  120  |  64  |  35  |  29  |  26  |  33  |
 +
 +  * MPI speedup near -np 24 is 8x serial
 +
 +^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^
 +|Wall Time (secs)|  12  |    |    |    |    |    |    |
 +
 +  * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
 +  * GPU parallel unable to measure
 +
 +^AMBER BENCHMARK EXAMPLES^^^^^^
 +|JAC_PRODUCTION_NVE - 23,558 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  12.87  |  80.50  |  88.76  |  103.09  |  122.45  |  ns/day  |
 +|  6713.99  |  1073.23  |  973.45  |  838.09  |  705.61  |  seconds/ns  |
 +|FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  3.95  |  22.25  |  27.47  |  32.56    39.52    ns/day  |
 +|  21865.59  |  3883.38  |  3145.32  |  2653.65  |  2186.28  |  seconds/ns  |
 +|CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.91    5.40  |  6.44  |  7.51     8.85  |  ns/day  |
 +|  95235.87  |  15986.42  |  13406.15  |  11509.28  |  9768.23    seconds/ns  |
 +|NUCLEOSOME_PRODUCTION - 25,095 atoms GB||||||
 +|  16 cpu cores  | 1xK20  |  2xK20    3xK20  | 4xK20  |  measure  |
 +|  0.06    2.79  |  3.65  |  3.98    ???    ns/day  |
 +|  1478614.67  |  31007.58  |  23694.29  |  21724.33  |  ???    seconds/ns  |
 +
 +
 +  * 5-6x performance speed ups using one GPU versus 16 CPU cores
 +  * 9-10x perrformance speedups using four GPUs versus 16 CPU cores
 +
 +
 +==== Setup ====
  
 First we get some CPU based data. First we get some CPU based data.
Line 16: Line 63:
  
 # parallel run, note that you will need create the machinefile # parallel run, note that you will need create the machinefile
-# if -np=4 it would would contain 4 lines with the string 'localhost'+# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
 mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \ mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
 -O -i mdin -o mdout -p prmtop \ -O -i mdin -o mdout -p prmtop \
Line 70: Line 117:
 </code> </code>
  
 +==== Script ====
  
 +<code>
 +
 +[TestDriveUser0@K20-WS]$ cat run
 +#!/bin/bash
 +rm -rf err out logfile mdout restrt mdinfo
 +
 +echo CPU serial
 +pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1core.serial.log
 +
 +echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +for i in 2 4 8 16 24 32
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}core.parallel.log 
 +done
 +
 +echo GPU serial
 +export CUDA_VISIBLE_DEVICES="2"
 +pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout 1gpu.serial.log
 +
 +echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
 +export CUDA_VISIBLE_DEVICES="2"
 +for i in 2
 +do
 +echo $i
 +mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 + -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
 +cp mdout ${i}gpu.parallel.log 
 +done
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/111.1359488809.txt.gz ยท Last modified: 2013/01/29 19:46 by hmeij