User Tools

Site Tools


cluster:164

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:164 [2017/10/24 10:32]
hmeij07 [Bench]
cluster:164 [2018/09/21 07:59] (current)
hmeij07
Line 81: Line 81:
 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 % 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 %
 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 % 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 %
- 
  
 # note, from nvidia-smi --help-query-gpu # note, from nvidia-smi --help-query-gpu
 "temperature.gpu" "temperature.gpu"
  Core GPU temperature. in degrees C.  Core GPU temperature. in degrees C.
 +
 +From vendor "80-83C is good actually. In some warmer environments you would be seeing 85-87C which is still just fine for 24/7 operation anyway."
 +
 </code> </code>
  
Line 92: Line 94:
 ==== Bench ==== ==== Bench ====
  
-  * Amber 16. My sample script runs 3-4x faster than on a K20 +  * Amber 16. Nucleosome bench runs 4.5x faster than on a K20 
-    * Do not have enough expertise to assess thisneed stats from Kelly+    * Not sure it is representative of our work load 
 +    * Adding more MPI threads decreases performance 
 +    * Running across more gpus (2 or 4) decreases performance 
 +    * One Amber process per MPI thread per GPU is optimal 
 + 
 +**Wow, I just realized the most important metric: Our k20 has a job throughput of 20 per unit of time. The amber128 queue will have a throughput of 4*4.5 or 18 per same unit of time. One new server matches five old oneswell purchased in 2013. From an amber only perspective.** 
 + 
 +<code> 
 + 
 +nvidia-smi -pm 0; nvidia-smi -c 0 
 +# gpu_id is done via CUDA_VISIBLE_DEVICES 
 +export CUDA_VISIBLE_DEVICES=$STRING_2 
 +# on n78 
 +/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f /home/hmeij/amber/nucleosome/hostfile \ 
 +-n $STRING_1 $AMBERHOME/bin/pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU \ 
 +-p prmtop -c inpcrd -ref inpcrd ; grep 'ns/day' /tmp/mdout 
 +# on n34 
 +/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh -hostfile /home/hmeij/amber/nucleosome/hostfile2 \ 
 +-np $STRING_1  pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU -p prmtop -c inpcrd -ref inpcrd; grep 'ns/day' /tmp/mdout 
 + 
 + 
 +Nucleosome Metric ns/day, seconds/ns  across all steps  x  nr of gpus 
 + 
 + 
 +GTX on n78 
 + 
 +-n 1, -gpu_id 0 
 +|         ns/day =      12.24   seconds/ns =    7058.94   x4 = 48.96  (4.5 faster than k20) 
 +-n 2, -gpu_id 0 
 +|         ns/day =      11.50   seconds/ns =    7509.97 
 +-n 4, -gpu_id 0 
 +|         ns/day =      10.54   seconds/ns =    8197.80 
 +-n 4, -gpu_id 01 
 +|         ns/day =      20.70   seconds/ns =    4173.55   x2 = 41.40 
 +-n 8, -gpu_id 01 
 +|         ns/day =      17.44   seconds/ns =    4953.04 
 +-n 4, -gpu_id 0123 
 +|         ns/day =      32.90   seconds/ns =    2626.27   x1 
 +-n 8, -gpu_id 0123 
 +|         ns/day =      28.43   seconds/ns =    3038.72   x1 
 + 
 + 
 +K20 on n34  
 + 
 +-n 1, -gpu_id 0 
 +|             ns/day =       2.71   seconds/ns =   31883.03 
 +-n 4, -gpu_id 0 
 +|             ns/day =       1.53   seconds/ns =   56325.00 
 +-n4, -gpuid 0123 
 +|             ns/day =       5.87   seconds/ns =   14730.45 
 + 
 + 
 + 
 +</code>
  
   * Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20   * Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20
Line 140: Line 195:
 Mapping of GPU IDs to the 16 PP ranks in this node: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 Mapping of GPU IDs to the 16 PP ranks in this node: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
 Performance:       19.814        1.211 (x16 = 317.024) Performance:       19.814        1.211 (x16 = 317.024)
 +
 +# UPDATE Gromacs 2018, check out these new performance stats for -n 4, -gpu=4
 +
 +# K20, redone with cuda 9
 +
 +root@cottontail gpu]# egrep 'ns/day|Performance' 0[0-4]/md.log
 +01/md.log:                 (ns/day)    (hour/ns)
 +01/md.log:Performance:       74.275        0.323
 +02/md.log:                 (ns/day)    (hour/ns)
 +02/md.log:Performance:       74.111        0.324
 +03/md.log:                 (ns/day)    (hour/ns)
 +03/md.log:Performance:       73.965        0.324
 +04/md.log:                 (ns/day)    (hour/ns)
 +04/md.log:Performance:       74.207        0.323
 +
 +# GTX1080 cuda 8
 + 
 +[hmeij@cottontail gpu]$ egrep 'ns/day|Performance' 0[1-4]/md.log
 +01/md.log:                 (ns/day)    (hour/ns)
 +01/md.log:Performance:      229.229        0.105
 +02/md.log:                 (ns/day)    (hour/ns)
 +02/md.log:Performance:      221.936        0.108
 +03/md.log:                 (ns/day)    (hour/ns)
 +03/md.log:Performance:      217.618        0.110
 +04/md.log:                 (ns/day)    (hour/ns)
 +04/md.log:Performance:      228.854        0.105
 +
 +Almost 900 ns/day for a single server.
  
 </code> </code>
  
-  * Lammps+  * Lammps 11Aug17 runs about 11x faster than K20 
 +    * used the colloid example, not sure if that's a good example 
 +    * like gromacs, lots of room for improvements  
 +    * used the double-double binary,surprised at speed  
 +      * single-double binary might run faster? 
 + 
 +<code> 
 + 
 +nvidia-smi -pm 0; nvidia-smi -c 0 
 +# gpu_id is done via CUDA_VISIBLE_DEVICES 
 +export CUDA_VISIBLE_DEVCES=$STRING_2 
 +# on n78 
 +/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile -n $STRING_1 \ 
 +/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu \ 
 +$STRING_3 -in in.colloid > /tmp/out ; grep tau /tmp/out 
 +# on n34 
 +/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh \ 
 +-hostfile /home/hmeij/sharptail/hostfile2 -np $STRING_1 \ 
 +/share/apps/CENTOS6/lammps/31Mar17/lmp_gpu_double \ 
 +-suffix gpu $STRING_3  -in in.colloid > /tmp/out ; grep tau /tmp/out 
 + 
 + 
 + 
 +Created 5625 atoms 
 + 
 +-n 1, -gpu_id 0 
 +Performance: 581,359 tau/day, 1,345 timesteps/s  
 +-n 2, -gpu_id 01 
 +Performance: 621,822 tau/day, 1,439 timesteps/s  
 +-n 4, -gpu_id 0123 
 +Performance: 479,795 tau/day, 1,110 timesteps/s  
 + 
 +-n 4, -gpu_id 01, -pk gpu 2 
 +Performance: 819,207 tau/day, 1,896 timesteps/s  
 +-n 8, -gpu_id 01, -pk gpu 2 
 +Performance: 519,173 tau/day, 1,201 timesteps/s  
 +-n 6, -gpu_id 0123, -pk gpu 4 
 +Performance: 881,981 tau/day, 2,041 timesteps/
 +-n 8, -gpu_id 0123, -pk gpu 4 
 +Performance: 932,493 tau/day, 2,158 timesteps/s (11x K20) 
 +-n 16, -gpu_id 0123, -pk gpu 4 
 +Performance: 582,717 tau/day, 1,348 timesteps/
 + 
 + 
 +K20 on n34  
 + 
 +-n8, -gpuid 0123, -pk gpu 4 
 +Performance: 84985 tau/day, 196 timesteps/s  
 + 
 + 
 +GTX on n78 again  
 +-n 8, -gpu_id 0123, -pk gpu 4 
 + 
 +Created 22500 atoms 
 +Performance: 552,986 tau/day, 1,280 timesteps/
 +Created 90000 atoms 
 +Performance: 210,864 tau/day, 488 timesteps/
 + 
 + 
 +</code>
  
 ==== Scripts ==== ==== Scripts ====
Line 449: Line 591:
 </code> </code>
  
 +==== PPMA Bench ====
 +
 +  * Runs fastest when constrined to one gpu with 4 mpi threads
 +  * Room for improvement as gpu and gpu memory are not fully utilized
 +  * Adding mpi threads or more gpus reduces ns/day performance
 +  * No idea if adding omp threads shows a different picture
 +  * No idea how it compares to K20 gpus
 +
 +<code>
 +
 +nvidia-smi -pm 0; nvidia-smi -c 0
 +# gpu_id is done via CUDA_VISIBLE_DEVICES
 +export CUDA_VISIBLE_DEVCES=[0,1,2,3]
 +
 +# on n78
 +cd /home/hmeij/lammps/benchmark
 +rm -f /tmp/lmp-run.log;rm -f *.jpg;\
 +time /usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile  -n $STRING_1 \
 +/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu -pk gpu $STRING_2 \
 +-in nvt.in -var t 310 > /dev/null 2>&1; grep ^Performance /tmp/lmp-run.log
 +
 +
 +PMMA Benchmark Performance Metric ns/day (x  nr of gpus for node output)
 +
 +
 +Lammps 11Aug17 on GTX1080Ti (n78)
 +
 +-n 1, -gpu_id 3
 +Performance: 19.974 ns/day, 1.202 hours/ns, 231.176 timesteps/s
 +3, GeForce GTX 1080 Ti, 38, 219 MiB, 10953 MiB, 30 %, 1 %                                                      
 +-n 2, -gpu_id 3
 +Performance: 33.806 ns/day, 0.710 hours/ns, 391.277 timesteps/s
 +3, GeForce GTX 1080 Ti, 57, 358 MiB, 10814 MiB, 47 %, 3 %
 +-n 4, -gpu_id 3
 +Performance: 48.504 ns/day, 0.495 hours/ns, 561.388 timesteps/s (x4 = 194 ns/day/node)
 +3, GeForce GTX 1080 Ti, 59, 690 MiB, 10482 MiB, 76 %, 4 %
 +-n 8, -gpu_id 3
 +Performance: 37.742 ns/day, 0.636 hours/ns, 436.833 timesteps/s
 +3, GeForce GTX 1080 Ti, 47, 1332 MiB, 9840 MiB, 90 %, 4 %
 +-n 4, -gpu_id 01
 +Performance: 57.621 ns/day, 0.417 hours/ns, 666.912 timesteps/
 +0, GeForce GTX 1080 Ti, 48, 350 MiB, 10822 MiB, 50 %, 3 %
 +1, GeForce GTX 1080 Ti, 37, 344 MiB, 10828 MiB, 49 %, 3 %
 +-n 8, -gpu_id 01
 +Performance: 63.625 ns/day, 0.377 hours/ns, 736.400 timesteps/s (x2 = 127 ns/day/node)
 +0, GeForce GTX 1080 Ti, 66, 670 MiB, 10502 MiB, 77 %, 4 %
 +1, GeForce GTX 1080 Ti, 51, 670 MiB, 10502 MiB, 81 %, 4 %
 +-n 12, -gpu_id 01
 +Performance: 61.198 ns/day, 0.392 hours/ns, 708.315 timesteps/s
 +0, GeForce GTX 1080 Ti, 65, 988 MiB, 10184 MiB, 82 %, 4 %
 +1, GeForce GTX 1080 Ti, 50, 990 MiB, 10182 MiB, 85 %, 4 %
 +-n 8, -gpu_id 0123
 +Performance: 86.273 ns/day, 0.278 hours/ns, 998.534 timesteps/
 +0, GeForce GTX 1080 Ti, 56, 340 MiB, 10832 MiB, 57 %, 3 %
 +1, GeForce GTX 1080 Ti, 41, 340 MiB, 10832 MiB, 52 %, 2 %
 +2, GeForce GTX 1080 Ti, 43, 340 MiB, 10832 MiB, 57 %, 3 %
 +3, GeForce GTX 1080 Ti, 42, 340 MiB, 10832 MiB, 55 %, 2 %
 +-n 12, -gpuid 0123
 +Performance: 108.905 ns/day, 0.220 hours/ns, 1260.478 timesteps/s (x1 = 109 ns/day/node)
 +-n 16
 +Performance: 88.989 ns/day, 0.270 hours/ns, 1029.964 timesteps/s
 +
 +
 +
 +# on n34
 +unable to get it to run...
 +
 +K20 on n34 
 +
 +-n 1, -gpu_id 0
 +-n 4, -gpu_id 0
 +-n 4, -gpuid 0123
 +
 +# comparison of binaries running PMMA
 +# 1 gpu 4 mpi threads each run
 +
 +# lmp_mpi-double-double-with-gpu.log
 +Performance: 49.833 ns/day, 0.482 hours/ns, 576.769 timesteps/s
 +# lmp_mpi-single-double-with-gpu.log
 +Performance: 58.484 ns/day, 0.410 hours/ns, 676.899 timesteps/s
 +# lmp_mpi-single-single-with-gpu.log
 +Performance: 56.660 ns/day, 0.424 hours/ns, 655.793 timesteps/s
 +
 +</code>
 +
 +==== FSL ====
 +
 +**User Time Reported** from time command
 +
 +  * mwgpu cpu run
 +  * 2013 model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
 +    * All tests 45m
 +    * Bft test 16m28s (bedpostx)
 +
 +  * amber128 cpu run
 +  * 2017 model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
 +    * All tests 17m - 2.5x faster
 +    * Bft test 3m39s - 6x faster (bedpostx)
 +
 +  * amber128 gpu run
 +  * 2017 CUDA Device Name: GeForce GTX 1080 Ti
 +    * Bft gpu test 0m1.881s (what!? from command line) - 116x faster (bedpostx_gpu)
 +    * Bft gpu test 0m1.850s (what!? via scheduler) - 118x faster (bedpostx_gpu)
 +
 +
 +==== FreeSurfer ====
 +
 +
 +  * http://freesurfer.net/fswiki/DownloadAndInstall#TestyourFreeSurferInstallation
 +  * Example using sample-001.mgz
 +
 +<code>
 +
 +Node n37 (mwgpu cpu run)
 +(2013) Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
 +recon-all -s bert finished without error
 +example 1 user    0m3.516s
 +example 2 user    893m1.761s ~15 hours
 +example 3 user    ???m       ~15 hours (estimated)
 +
 +Node n78 (amber128 cpu run)
 +(2017) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
 +recon-all -s bert finished without error
 +example 1 user    0m2.315s
 +example 2 user    488m49.215s ~8 hours
 +example 3 user    478m44.622s ~8 hours
 +
 +
 +freeview -v \
 +    bert/mri/T1.mgz \
 +    bert/mri/wm.mgz \
 +    bert/mri/brainmask.mgz \
 +    bert/mri/aseg.mgz:colormap=lut:opacity=0.2 \
 +    -f \
 +    bert/surf/lh.white:edgecolor=blue \
 +    bert/surf/lh.pial:edgecolor=red \
 +    bert/surf/rh.white:edgecolor=blue \
 +    bert/surf/rh.pial:edgecolor=red
 +
 +
 +</code>
 +
 +Development code for the GPU http://surfer.nmr.mgh.harvard.edu/fswiki/freesurfer_linux_developers_page
 + 
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/164.1508855528.txt.gz · Last modified: 2017/10/24 10:32 (external edit)