User Tools

Site Tools


cluster:175

This is an old revision of the document!



Back

P100 vs GTX & K20

P100 GTX K20
cores 3,584 3,584 2,496 count
mem 12/16 11 5 gb
ghz 2.6 1.6 0.7 speed
flops 4.7/5.3 0.355 1.15 dpfp

Comparing these GPUs yields the following results presented below. These are not “benchmark suites” so your mileage may vary. It will give us some comparative information for decision making on our 2018 GPU Expansion Project. The GTX & K20 data comes from this page GTX 1080 Ti

Credits: This work was made possible, in part, through HPC time donated by Microway, Inc. We gratefully acknowledge Microway for providing access to their GPU-accelerated compute cluster. Microway

Amber

Amber16 continues to run best when one MPI process launches the GPU counterpart. One can not complain about utilization rates. So a dual P100 server delivers 24 ns/day and a quad P100 server delivers near 48 ns/day. Our quad GTX1080 server delivers 48.96 ns/day (4.5x faster than K20). We have dual P100 nodes quoted.

mpirun -x LD_LIBRARY_PATH -np 1 -H localhost pmemd.cuda.MPI \
 -O -o mdout.0 -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10.0 -ref inpcrd

gpu=1 mpi=1 11.94 ns/day
any mpi>1 and performance goes down...

[heme@login1 amber]$ ssh node6 ~/p100-info
index, name, temp.gpu, mem.used [MiB], mem.free [MiB], util.gpu [%], util.mem [%]
0, Tesla P100-PCIE-16GB, 71, 327 MiB, 15953 MiB, 100 %, 0 %
1, Tesla P100-PCIE-16GB, 49, 327 MiB, 15953 MiB, 100 %, 0 %
2, Tesla P100-PCIE-16GB, 44, 327 MiB, 15953 MiB, 100 %, 0 %
3, Tesla P100-PCIE-16GB, 43, 327 MiB, 15953 MiB, 100 %, 0 %

Look at these gpu temperatures, that's Celsius.

Lammps

We can also not complain about gpu utilization in this example. We tend to achieve better performance with cpu:gpu ratios in the 4:1 range on our GTX server but not on this cluster. Best performance was obtained when cpu threads equaled the number of gpus used.

On our GTX server best performance was a ratio of 16:4 cpu:gpu for 932,493 tau/day (11x faster than our K20). However scaling the job to a ratio of cpu:gpu of 4:2 yields 819,207 tau/day which means a quad server can deliver about 1.6 million tau/day.

A single P100 gpu beats this easily coming in at 2.6 million tau/day. Spreading the problem over more gpus did raise overall performance to 3.3 million tau/day. However, four cpu:gpu 1:1 jobs would achieve slightly over 10 million tau/day. That is almost 10x faster than our GTX server.

mpirun --oversubscribe -x LD_LIBRARY_PATH -np 8 \
-H localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost \
lmp_mpi-double-double-with-gpu -suffix gpu -pk gpu 4 \
-in in.colloid > out.1 

gpu=1 mpi=1
Performance: 2684821.234 tau/day, 6214.864 timesteps/s
gpu=2 mpi=2
Performance: 3202640.823 tau/day, 7413.520 timesteps/s
gpu=4 mpi=4
Performance: 3341009.801 tau/day, 7733.819 timesteps/s
any mpi>gpu yielded degraded performance...

index, name, temp.gpu, mem.used [MiB], mem.free [MiB], util.gpu [%], util.mem [%]
0, Tesla P100-PCIE-16GB, 35, 596 MiB, 15684 MiB, 82 %, 2 %
1, Tesla P100-PCIE-16GB, 38, 596 MiB, 15684 MiB, 77 %, 2 %
2, Tesla P100-PCIE-16GB, 37, 596 MiB, 15684 MiB, 81 %, 2 %
3, Tesla P100-PCIE-16GB, 37, 596 MiB, 15684 MiB, 80 %, 2 %

Lammps (PMMA)

Using a material called PMMA (https://en.wikipedia.org/wiki/Poly(methyl_methacrylate) aka acrylic glass or plexiglas (“safety glass”). The PMMA simulations require the calculation of molecular bonds, which is not implemented in GPU hence more CPU cores are required than the Coillod example. The optimal ratio cpu:gpu appears to be 4-6:1.

gpu cpus ns/day quad ns/day/node
1 P100 4 89 x4 356
1 GTX 6 90 x4 360
1 K20 6 47 x4 188

That means the P100 works as well as the GTX. The K20 works at 50% the performance level of the others which is impressive for this old gpu.

Gromacs

Gromacs has shown vastly improved performance between versions. v5 delivered about 20 ns/day per K20 server and 350 ns/day on GTX server. v2018 delivered 75 ns/day per K20 server and 900 ns/day on GTX server. A roughly 3x improvement.

On the P100 test node, I could not invoke the multidir option of gromacs (have run it on GTX, weird). The utilization of the gpu drops as more and more gpus are deployed. The optimum performance was with dual gpus achieving 36 ns/day. Four one gpu jobs would deliver 120 ns/day/server, far short of the 900 ns/day for our GTX server. (We only have dual P100 nodes quoted).

mpirun -np 25 --oversubscribe -x LD_LIBRARY_PATH -H \
localhost,localhost,localhost,localhost,localhost,localhost,\
localhost,localhost,localhost,localhost,localhost,localhost,\
localhost,localhost,localhost,localhost,localhost,localhost,\
localhost,localhost,localhost,localhost,localhost,localhost,localhost \
gmx_mpi mdrun -gpu_id 0123 -ntmpi 0 -nt 0 \
 -s topol.tpr -ntomp 4 -npme 1 -nsteps 20000 -pin on -nb gpu

# this does not run
#gmx_mpi mdrun -multidir 01 02 03 04 -gpu_id 0123 -ntmpi 0 -nt 0 \
# -s topol.tpr -ntomp 4 -npme 1 -maxh 0.5 -pin on -nb gpu

                   ns/day       hrs/ns
gpu=4 mpi=25 (ntomp=4 -npme 1)
Performance:       34.632        0.693
gpu=3 (same)
Performance:       36.218        0.663
gpu=2 (same)
Performance:       36.198        0.663
gpu=1 (same)
Performance:       30.256        0.793

index, name, temp.gpu, mem.used [MiB], mem.free [MiB], util.gpu [%], util.mem [%]
0, Tesla P100-PCIE-16GB, 36, 7048 MiB, 9232 MiB, 97 %, 4 %

Gromacs 2018.3

multidir -gpu_id 0123
-np  8 -ntomp  4 -npme 1 -maxh 0.1 -pin on -nb gpu
01/md.log:Performance:       36.692        0.654
02/md.log:Performance:       36.650        0.655
03/md.log:Performance:       36.623        0.655
04/md.log:Performance:       36.663        0.655
-np 16 -ntomp  8 -npme 1 -maxh 0.1 -pin on -nb gpu
01/md.log:Performance:       25.151        0.954
02/md.log:Performance:       25.257        0.950
03/md.log:Performance:       25.247        0.951
04/md.log:Performance:       25.345        0.947

multidir -gpu_id 00112233
-np  8 -ntomp  4 -npme 1 -maxh 0.1 -pin on -nb gpu
Error in user input:
The string of available GPU device IDs '00112233' may not contain duplicate
device IDs

What to Buy

  • Amber folks: does not matter
  • Lammps folks: P100 nodes please
  • Gromacs folks: GTX nodes please

Remember that GTX gpus placed in Data Center voids their warranty.


Back

cluster/175.1537992086.txt.gz · Last modified: 2018/09/26 16:01 by hmeij07