This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:175 [2018/09/21 14:05] hmeij07 [GTX vs P100] |
cluster:175 [2018/09/22 18:28] hmeij07 [P100 vs GTX & K20] |
||
---|---|---|---|
Line 3: | Line 3: | ||
**[[cluster: | **[[cluster: | ||
- | ==== GTX vs P100 vs K20 ==== | + | ==== P100 vs GTX & K20 ==== |
- | Comparing these two GPUs yields the following | + | ^ ^ P100 ^ GTX ^ K20 ^ ^ |
+ | | cores | 3,584 | 3,584 | 2,496 | count | | ||
+ | | mem | 12/16 | 11 | 5 | gb | | ||
+ | | ghz | 2.6 | 1.6 | 0.7 | speed | | ||
+ | | flops | 4.7 | 0.355 | 1.15 | dpfp | | ||
+ | |||
+ | Comparing these GPUs yields the following | ||
Credits: This work was made possible, in part, through HPC time donated by Microway, Inc. We gratefully acknowledge Microway for providing access to their GPU-accelerated compute cluster. | Credits: This work was made possible, in part, through HPC time donated by Microway, Inc. We gratefully acknowledge Microway for providing access to their GPU-accelerated compute cluster. | ||
Line 12: | Line 18: | ||
==== Amber ==== | ==== Amber ==== | ||
- | Amber16 continues to run best when one MPI process launches the GPU counterpart. | + | Amber16 continues to run best when one MPI process launches the GPU counterpart. |
< | < | ||
Line 20: | Line 26: | ||
gpu=1 mpi=1 11.94 ns/day | gpu=1 mpi=1 11.94 ns/day | ||
- | any mpi> | + | any mpi> |
[heme@login1 amber]$ ssh node6 ~/p100-info | [heme@login1 amber]$ ssh node6 ~/p100-info | ||
Line 30: | Line 36: | ||
</ | </ | ||
+ | |||
+ | ==== Lammps ==== | ||
+ | |||
+ | We can also not complain about gpu utilization in this example. | ||
+ | |||
+ | On our GTX server best performance was a ratio of 16:4 cpu:gpu for 932,493 tau/day (11x faster than our K20). However scaling the job to a ratio of cpu:gpu of 4:2 yields 819,207 tau/day which means a quad server can deliver about 1.6 million tau/day. | ||
+ | |||
+ | A single P100 gpu beats this easily coming in at 2.6 million tau/day. Spreading the problem over more gpus did raise overall performance to 3.3 million tau/day. However, four cpu:gpu 1:1 jobs would achieve slightly over 10 million tau/day. That is almost 10x faster than our GTX server. | ||
+ | |||
+ | < | ||
+ | |||
+ | mpirun --oversubscribe -x LD_LIBRARY_PATH -np 8 \ | ||
+ | -H localhost, | ||
+ | lmp_mpi-double-double-with-gpu -suffix gpu -pk gpu 4 \ | ||
+ | -in in.colloid > out.1 | ||
+ | |||
+ | gpu=1 mpi=1 | ||
+ | Performance: | ||
+ | gpu=2 mpi=2 | ||
+ | Performance: | ||
+ | gpu=4 mpi=4 | ||
+ | Performance: | ||
+ | any mpi>gpu yielded degraded performance... | ||
+ | |||
+ | index, name, temp.gpu, mem.used [MiB], mem.free [MiB], util.gpu [%], util.mem [%] | ||
+ | 0, Tesla P100-PCIE-16GB, | ||
+ | 1, Tesla P100-PCIE-16GB, | ||
+ | 2, Tesla P100-PCIE-16GB, | ||
+ | 3, Tesla P100-PCIE-16GB, | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Gromacs ==== | ||
+ | |||
+ | Gromacs has shown vastly improved performance between versions. v5 delivered about 20 ns/day per K20 server and 350 ns/day on GTX server. v2018 delivered 75 ns/day per K20 server and 900 ns/day on GTX server. A roughly 3x improvement. | ||
+ | |||
+ | On the P100 test node, I could not invoke the multidir option of gromacs (have run it on GTX, weird). The utilization of the gpu drops as more and more gpus are deployed. | ||
+ | |||
+ | < | ||
+ | |||
+ | mpirun -np 25 --oversubscribe -x LD_LIBRARY_PATH -H \ | ||
+ | localhost, | ||
+ | localhost, | ||
+ | localhost, | ||
+ | localhost, | ||
+ | gmx_mpi mdrun -gpu_id 0123 -ntmpi 0 \ | ||
+ | -s topol.tpr -ntomp 4 -npme 1 -nsteps 20000 -pin on -nb gpu | ||
+ | |||
+ | # this does not run | ||
+ | #gmx_mpi mdrun -multidir 01 02 03 04 -gpu_id 0123 -ntmpi 0 -nt 0 \ | ||
+ | # -s topol.tpr -ntomp 4 -npme 1 -maxh 0.5 -pin on -nb gpu | ||
+ | |||
+ | | ||
+ | gpu=4 mpi=25 (ntomp=4 -npme 1) | ||
+ | Performance: | ||
+ | gpu=3 (same) | ||
+ | Performance: | ||
+ | gpu=2 (same) | ||
+ | Performance: | ||
+ | gpu=1 (same) | ||
+ | Performance: | ||
+ | |||
+ | index, name, temp.gpu, mem.used [MiB], mem.free [MiB], util.gpu [%], util.mem [%] | ||
+ | 0, Tesla P100-PCIE-16GB, | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== What to Buy ==== | ||
+ | |||
+ | * Amber folks: does not matter | ||
+ | * Lammps folks: P100 nodes please | ||
+ | * Gromacs folks: GTX nodes please | ||
+ | |||
+ | Remember that GTX gpus placed in Data Center voids their warranty. | ||