\\ **[[cluster:0|Back]]** ===== Amber GPU Testing (EC) ===== We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. ==== Results ==== * Verified the MPI threads and GPU invocations * Verified the output data * pmemd.cuda.MPI errors * Script used is listed at end of this page ^ PMEMD implementation of SANDER, Release 12 ^ |Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set| ^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ |Wall Time (secs)| 211 | 120 | 64 | 35 | 29 | 26 | 33 | * MPI speedup near -np 24 is 8x serial ^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ |Wall Time (secs)| 12 | | | | | | | * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x * GPU parallel unable to measure ^AMBER BENCHMARK EXAMPLES^^^^^^ |JAC_PRODUCTION_NVE - 23,558 atoms PME|||||| | 16 cpu cores | 1xK20 | 2xK20 | 3xK20 | 4xK20 | measure | | 12.87 | 80.50 | 88.76 | 103.09 | 122.45 | ns/day | | 6713.99 | 1073.23 | 973.45 | 838.09 | 705.61 | seconds/ns | |FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME|||||| | 16 cpu cores | 1xK20 | 2xK20 | 3xK20 | 4xK20 | measure | | 3.95 | 22.25 | 27.47 | 32.56 | 39.52 | ns/day | | 21865.59 | 3883.38 | 3145.32 | 2653.65 | 2186.28 | seconds/ns | |CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME|||||| | 16 cpu cores | 1xK20 | 2xK20 | 3xK20 | 4xK20 | measure | | 0.91 | 5.40 | 6.44 | 7.51 | 8.85 | ns/day | | 95235.87 | 15986.42 | 13406.15 | 11509.28 | 9768.23 | seconds/ns | |NUCLEOSOME_PRODUCTION - 25,095 atoms GB|||||| | 16 cpu cores | 1xK20 | 2xK20 | 3xK20 | 4xK20 | measure | | 0.06 | 2.79 | 3.65 | 3.98 | ??? | ns/day | | 1478614.67 | 31007.58 | 23694.29 | 21724.33 | ??? | seconds/ns | * 5-6x performance speed ups using one GPU versus 16 CPU cores * 9-10x perrformance speedups using four GPUs versus 16 CPU cores ==== Setup ==== First we get some CPU based data.



# serial run of pmemd
nohup $AMBERHOME/bin/pmemd -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd 


The following script should be in your path ... located in ~/bin

You need to allocate one or more GPUs for your cuda runs.


node2$ gpu-info
====================================================
Device  Model           Temperature     Utilization
====================================================
0       Tesla K20       27 C             0 %
1       Tesla K20       28 C             0 %
2       Tesla K20       27 C             0 %
3       Tesla K20       30 C             0 %
====================================================



Next we need to expose these GPUs to pmemd ... 



# expose one
export CUDA_VISIBLE_DEVICES="0"

# serial run of pmemd.cuda
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd 


You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough.



# expose multiple (for serial or parallel runs)
export CUDA_VISIBLE_DEVICES="0,2"



==== Script ====



[TestDriveUser0@K20-WS]$ cat run
#!/bin/bash
rm -rf err out logfile mdout restrt mdinfo

echo CPU serial
pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout 1core.serial.log

echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
for i in 2 4 8 16 24 32
do
echo $i
mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout ${i}core.parallel.log 
done

echo GPU serial
export CUDA_VISIBLE_DEVICES="2"
pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout 1gpu.serial.log

echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
export CUDA_VISIBLE_DEVICES="2"
for i in 2
do
echo $i
mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout ${i}gpu.parallel.log 
done



\\
**[[cluster:0|Back]]**