DokuWiki

Amber GPU Testing (EC)

We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd.

Results

Verified the MPI threads and GPU invocations
Verified the output data
pmemd.cuda.MPI errors
Script used is listed at end of this page

PMEMD implementation of SANDER, Release 12
Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set

CPU Jobs (1,000 steps)	Serial	-np 2	-np 4	-np 8	-np 16	-np 24	-np 32
Wall Time (secs)	211	120	64	35	29	26	33

MPI speedup near -np 24 is 8x serial

GPU Jobs	Serial	-np 2	-np 4	-np 8	-np 16	-np 24	-np 32
Wall Time (secs)	12

GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x
GPU parallel unable to measure

AMBER BENCHMARK EXAMPLES
JAC_PRODUCTION_NVE - 23,558 atoms PME
16 cpu cores	1xK20	2xK20	3xK20	4xK20	measure
12.87	80.50	88.76	103.09	122.45	ns/day
6713.99	1073.23	973.45	838.09	705.61	seconds/ns
FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
16 cpu cores	1xK20	2xK20	3xK20	4xK20	measure
3.95	22.25	27.47	32.56	39.52	ns/day
21865.59	3883.38	3145.32	2653.65	2186.28	seconds/ns
CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
16 cpu cores	1xK20	2xK20	3xK20	4xK20	measure
0.91	5.40	6.44	7.51	8.85	ns/day
95235.87	15986.42	13406.15	11509.28	9768.23	seconds/ns
NUCLEOSOME_PRODUCTION - 25,095 atoms GB
16 cpu cores	1xK20	2xK20	3xK20	4xK20	measure
0.06	2.79	3.65	3.98	???	ns/day
1478614.67	31007.58	23694.29	21724.33	???	seconds/ns

5-6x performance speed ups using one GPU versus 16 CPU cores
9-10x perrformance speedups using four GPUs versus 16 CPU cores

Setup

First we get some CPU based data.

# serial run of pmemd
nohup $AMBERHOME/bin/pmemd -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

# parallel run, note that you will need create the machinefile
# if -np=4 it would would contain 4 lines with the string 'localhost'...does not work, use hostname
mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.MPI \
-O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

The following script should be in your path … located in ~/bin

You need to allocate one or more GPUs for your cuda runs.

node2$ gpu-info
====================================================
Device  Model           Temperature     Utilization
====================================================
0       Tesla K20       27 C             0 %
1       Tesla K20       28 C             0 %
2       Tesla K20       27 C             0 %
3       Tesla K20       30 C             0 %
====================================================

Next we need to expose these GPUs to pmemd …

# expose one
export CUDA_VISIBLE_DEVICES="0"

# serial run of pmemd.cuda
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

# parallel run, note that you will need create the machinefile
# if -np=4 it would could contain 4 lines with the string 'localhost'
mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.cuda.MPI \
-O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough.

# expose multiple (for serial or parallel runs)
export CUDA_VISIBLE_DEVICES="0,2"

Script

[TestDriveUser0@K20-WS]$ cat run
#!/bin/bash
rm -rf err out logfile mdout restrt mdinfo

echo CPU serial
pmemd -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout 1core.serial.log

echo CPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
for i in 2 4 8 16 24 32
do
echo $i
mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout ${i}core.parallel.log 
done

echo GPU serial
export CUDA_VISIBLE_DEVICES="2"
pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout 1gpu.serial.log

echo GPU parallel 2,4,8,16 /usr/local/mpich2-1.4.1p1/bin/mpirun
export CUDA_VISIBLE_DEVICES="2"
for i in 2
do
echo $i
mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \
 -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1
cp mdout ${i}gpu.parallel.log 
done

Back

Table of Contents

Amber GPU Testing (EC)

Results

Setup

Script