cluster:111
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:111 [2013/01/29 19:46] – [Amber GPU Testing (EC)] hmeij | cluster:111 [2013/02/04 19:28] (current) – [Results] hmeij | ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| ===== Amber GPU Testing (EC) ===== | ===== Amber GPU Testing (EC) ===== | ||
| + | |||
| We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. | We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. | ||
| + | |||
| + | ==== Results ==== | ||
| + | |||
| + | * Verified the MPI threads and GPU invocations | ||
| + | * Verified the output data | ||
| + | * pmemd.cuda.MPI errors | ||
| + | * Script used is listed at end of this page | ||
| + | |||
| + | ^ PMEMD implementation of SANDER, Release 12 ^ | ||
| + | |Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set| | ||
| + | |||
| + | ^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ | ||
| + | |Wall Time (secs)| | ||
| + | |||
| + | * MPI speedup near -np 24 is 8x serial | ||
| + | |||
| + | ^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ | ||
| + | |Wall Time (secs)| | ||
| + | |||
| + | * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x | ||
| + | * GPU parallel unable to measure | ||
| + | |||
| + | ^AMBER BENCHMARK EXAMPLES^^^^^^ | ||
| + | |JAC_PRODUCTION_NVE - 23,558 atoms PME|||||| | ||
| + | | 16 cpu cores | 1xK20 | 2xK20 | ||
| + | | 12.87 | 80.50 | 88.76 | 103.09 | ||
| + | | 6713.99 | ||
| + | |FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME|||||| | ||
| + | | 16 cpu cores | 1xK20 | 2xK20 | ||
| + | | 3.95 | 22.25 | 27.47 | 32.56 | ||
| + | | 21865.59 | ||
| + | |CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME|||||| | ||
| + | | 16 cpu cores | 1xK20 | 2xK20 | ||
| + | | 0.91 | ||
| + | | 95235.87 | ||
| + | |NUCLEOSOME_PRODUCTION - 25,095 atoms GB|||||| | ||
| + | | 16 cpu cores | 1xK20 | 2xK20 | ||
| + | | 0.06 | ||
| + | | 1478614.67 | ||
| + | |||
| + | |||
| + | * 5-6x performance speed ups using one GPU versus 16 CPU cores | ||
| + | * 9-10x perrformance speedups using four GPUs versus 16 CPU cores | ||
| + | |||
| + | |||
| + | ==== Setup ==== | ||
| First we get some CPU based data. | First we get some CPU based data. | ||
| Line 16: | Line 63: | ||
| # parallel run, note that you will need create the machinefile | # parallel run, note that you will need create the machinefile | ||
| - | # if -np=4 it would would contain 4 lines with the string ' | + | # if -np=4 it would would contain 4 lines with the string ' |
| mpirun --machinefile=nodefile -np 4 $AMBERHOME/ | mpirun --machinefile=nodefile -np 4 $AMBERHOME/ | ||
| -O -i mdin -o mdout -p prmtop \ | -O -i mdin -o mdout -p prmtop \ | ||
| Line 70: | Line 117: | ||
| </ | </ | ||
| + | ==== Script ==== | ||
| + | < | ||
| + | |||
| + | [TestDriveUser0@K20-WS]$ cat run | ||
| + | #!/bin/bash | ||
| + | rm -rf err out logfile mdout restrt mdinfo | ||
| + | |||
| + | echo CPU serial | ||
| + | pmemd -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
| + | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
| + | cp mdout 1core.serial.log | ||
| + | |||
| + | echo CPU parallel 2,4,8,16 / | ||
| + | for i in 2 4 8 16 24 32 | ||
| + | do | ||
| + | echo $i | ||
| + | mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
| + | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
| + | cp mdout ${i}core.parallel.log | ||
| + | done | ||
| + | |||
| + | echo GPU serial | ||
| + | export CUDA_VISIBLE_DEVICES=" | ||
| + | pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
| + | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
| + | cp mdout 1gpu.serial.log | ||
| + | |||
| + | echo GPU parallel 2,4,8,16 / | ||
| + | export CUDA_VISIBLE_DEVICES=" | ||
| + | for i in 2 | ||
| + | do | ||
| + | echo $i | ||
| + | mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
| + | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
| + | cp mdout ${i}gpu.parallel.log | ||
| + | done | ||
| + | |||
| + | </ | ||
| \\ | \\ | ||
| **[[cluster: | **[[cluster: | ||
cluster/111.1359488809.txt.gz · Last modified: by hmeij
