This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:111 [2013/01/29 14:28] hmeij |
cluster:111 [2013/02/02 13:25] hmeij [Results] |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== Amber GPU Testing (EC) ===== | ===== Amber GPU Testing (EC) ===== | ||
+ | |||
+ | |||
+ | We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. | ||
+ | |||
+ | ==== Results ==== | ||
+ | |||
+ | * Verified the MPI threads and GPU invocations | ||
+ | * Verified the output data | ||
+ | * pmemd.cuda.MPI errors | ||
+ | * Script used is listed at end of this page | ||
+ | |||
+ | ^ PMEMD implementation of SANDER, Release 12 ^ | ||
+ | |Minimzing the system with 25 kcal/mol restraints on protein, 500 steps of steepest descent and 500 of conjugated gradient - Surjit Dixit problem set| | ||
+ | |||
+ | ^CPU Jobs (1,000 steps)^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ | ||
+ | |Wall Time (secs)| | ||
+ | |||
+ | * MPI speedup near -np 24 is 8x serial | ||
+ | |||
+ | ^GPU Jobs^ Serial ^ -np 2 ^ -np 4 ^ -np 8 ^ -np 16 ^ -np 24 ^ -np 32 ^ | ||
+ | |Wall Time (secs)| | ||
+ | |||
+ | * GPU serial speedup is 17.5x CPU serial performance and outperforms MPI by at least 2x | ||
+ | * GPU parallel unable to measure | ||
+ | |||
+ | ^AMBER BENCHMARK EXAMPLES^^^^^^ | ||
+ | |JAC_PRODUCTION_NVE - 23,558 atoms PME|||||| | ||
+ | | 16 cpu cores | 1xK20 | 2xK20 | ||
+ | | 12.87 | 80.50 | 88.76 | 103.09 | ||
+ | | 6713.99 | ||
+ | |FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME|||||| | ||
+ | | 16 cpu cores | 1xK20 | 2xK20 | ||
+ | | 3.95 | 22.25 | 27.47 | 32.56 | ||
+ | | 21865.59 | ||
+ | |CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME|||||| | ||
+ | | 16 cpu cores | 1xK20 | 2xK20 | ||
+ | | 0.91 | ||
+ | | 95235.87 | ||
+ | |||
+ | |||
+ | * 5-6x performance speed ups using one GPU versus 16 CPU cores | ||
+ | * 9-10x perrformance speedups using four GPUs versus 16 CPU cores | ||
+ | |||
+ | |||
+ | ==== Setup ==== | ||
+ | |||
+ | First we get some CPU based data. | ||
+ | |||
+ | < | ||
+ | |||
+ | # serial run of pmemd | ||
+ | nohup $AMBERHOME/ | ||
+ | -c inpcrd -r restrt -x mdcrd </ | ||
+ | |||
+ | # parallel run, note that you will need create the machinefile | ||
+ | # if -np=4 it would would contain 4 lines with the string ' | ||
+ | mpirun --machinefile=nodefile -np 4 $AMBERHOME/ | ||
+ | -O -i mdin -o mdout -p prmtop \ | ||
+ | -c inpcrd -r restrt -x mdcrd </ | ||
+ | |||
+ | </ | ||
The following script should be in your path ... located in ~/bin | The following script should be in your path ... located in ~/bin | ||
+ | |||
+ | You need to allocate one or more GPUs for your cuda runs. | ||
< | < | ||
- | $ gpu-info | + | node2$ gpu-info |
==================================================== | ==================================================== | ||
- | Device | + | Device |
==================================================== | ==================================================== | ||
- | 0 Tesla C2070 83 C 0 % | + | 0 |
- | 1 Tesla C2070 86 C 0 % | + | 1 |
- | 2 Tesla C2070 88 C 99 % | + | 2 |
- | 3 Tesla C2070 87 C 99 % | + | 3 |
==================================================== | ==================================================== | ||
</ | </ | ||
+ | |||
+ | Next we need to expose these GPUs to pmemd ... | ||
+ | |||
+ | < | ||
+ | |||
+ | # expose one | ||
+ | export CUDA_VISIBLE_DEVICES=" | ||
+ | |||
+ | # serial run of pmemd.cuda | ||
+ | nohup $AMBERHOME/ | ||
+ | -c inpcrd -r restrt -x mdcrd </ | ||
+ | |||
+ | # parallel run, note that you will need create the machinefile | ||
+ | # if -np=4 it would could contain 4 lines with the string ' | ||
+ | mpirun --machinefile=nodefile -np 4 $AMBERHOME/ | ||
+ | -O -i mdin -o mdout -p prmtop \ | ||
+ | -c inpcrd -r restrt -x mdcrd </ | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | You may want to try to run your pmemd problem across multiple GPUs if problem set is large enough. | ||
+ | |||
+ | < | ||
+ | |||
+ | # expose multiple (for serial or parallel runs) | ||
+ | export CUDA_VISIBLE_DEVICES=" | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Script ==== | ||
+ | |||
+ | < | ||
+ | |||
+ | [TestDriveUser0@K20-WS]$ cat run | ||
+ | #!/bin/bash | ||
+ | rm -rf err out logfile mdout restrt mdinfo | ||
+ | |||
+ | echo CPU serial | ||
+ | pmemd -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
+ | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
+ | cp mdout 1core.serial.log | ||
+ | |||
+ | echo CPU parallel 2,4,8,16 / | ||
+ | for i in 2 4 8 16 24 32 | ||
+ | do | ||
+ | echo $i | ||
+ | mpirun --machinefile=nodefile$i -np $i pmemd.MPI -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
+ | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
+ | cp mdout ${i}core.parallel.log | ||
+ | done | ||
+ | |||
+ | echo GPU serial | ||
+ | export CUDA_VISIBLE_DEVICES=" | ||
+ | pmemd.cuda -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
+ | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
+ | cp mdout 1gpu.serial.log | ||
+ | |||
+ | echo GPU parallel 2,4,8,16 / | ||
+ | export CUDA_VISIBLE_DEVICES=" | ||
+ | for i in 2 | ||
+ | do | ||
+ | echo $i | ||
+ | mpirun --machinefile=nodefile$i -np $i pmemd.cuda.MPI -O -i inp/mini.in -p 1g6r.cd.parm \ | ||
+ | -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 2>&1 | ||
+ | cp mdout ${i}gpu.parallel.log | ||
+ | done | ||
+ | |||
+ | </ | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |