User Tools

Site Tools


cluster:164

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:164 [2017/10/24 08:57]
hmeij07
cluster:164 [2018/09/21 07:59] (current)
hmeij07
Line 81: Line 81:
 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 % 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 %
 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 % 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 %
- 
  
 # note, from nvidia-smi --help-query-gpu # note, from nvidia-smi --help-query-gpu
 "temperature.gpu" "temperature.gpu"
  Core GPU temperature. in degrees C.  Core GPU temperature. in degrees C.
 +
 +From vendor "80-83C is good actually. In some warmer environments you would be seeing 85-87C which is still just fine for 24/7 operation anyway."
 +
 </code> </code>
  
Line 92: Line 94:
 ==== Bench ==== ==== Bench ====
  
-  * Amber 16. My sample script runs 3-4x faster than on a K20 +  * Amber 16. Nucleosome bench runs 4.5x faster than on a K20 
-    * Do not have enough expertise to assess thisneed stats from Kelly+    * Not sure it is representative of our work load 
 +    * Adding more MPI threads decreases performance 
 +    * Running across more gpus (2 or 4) decreases performance 
 +    * One Amber process per MPI thread per GPU is optimal 
 + 
 +**Wow, I just realized the most important metric: Our k20 has a job throughput of 20 per unit of time. The amber128 queue will have a throughput of 4*4.5 or 18 per same unit of time. One new server matches five old oneswell purchased in 2013. From an amber only perspective.** 
 + 
 +<code> 
 + 
 +nvidia-smi -pm 0; nvidia-smi -c 0 
 +# gpu_id is done via CUDA_VISIBLE_DEVICES 
 +export CUDA_VISIBLE_DEVICES=$STRING_2 
 +# on n78 
 +/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f /home/hmeij/amber/nucleosome/hostfile \ 
 +-n $STRING_1 $AMBERHOME/bin/pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU \ 
 +-p prmtop -c inpcrd -ref inpcrd ; grep 'ns/day' /tmp/mdout 
 +# on n34 
 +/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh -hostfile /home/hmeij/amber/nucleosome/hostfile2 \ 
 +-np $STRING_1  pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU -p prmtop -c inpcrd -ref inpcrd; grep 'ns/day' /tmp/mdout 
 + 
 + 
 +Nucleosome Metric ns/day, seconds/ns  across all steps  x  nr of gpus 
 + 
 + 
 +GTX on n78 
 + 
 +-n 1, -gpu_id 0 
 +|         ns/day =      12.24   seconds/ns =    7058.94   x4 = 48.96  (4.5 faster than k20) 
 +-n 2, -gpu_id 0 
 +|         ns/day =      11.50   seconds/ns =    7509.97 
 +-n 4, -gpu_id 0 
 +|         ns/day =      10.54   seconds/ns =    8197.80 
 +-n 4, -gpu_id 01 
 +|         ns/day =      20.70   seconds/ns =    4173.55   x2 = 41.40 
 +-n 8, -gpu_id 01 
 +|         ns/day =      17.44   seconds/ns =    4953.04 
 +-n 4, -gpu_id 0123 
 +|         ns/day =      32.90   seconds/ns =    2626.27   x1 
 +-n 8, -gpu_id 0123 
 +|         ns/day =      28.43   seconds/ns =    3038.72   x1 
 + 
 + 
 +K20 on n34  
 + 
 +-n 1, -gpu_id 0 
 +|             ns/day =       2.71   seconds/ns =   31883.03 
 +-n 4, -gpu_id 0 
 +|             ns/day =       1.53   seconds/ns =   56325.00 
 +-n4, -gpuid 0123 
 +|             ns/day =       5.87   seconds/ns =   14730.45 
 + 
 + 
 + 
 +</code>
  
   * Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20   * Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20
Line 99: Line 154:
     * 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day     * 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day
  
-  * Lammps+<code> 
 + 
 +# about 20 mins per run 
 +/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile $STRING_1 \ 
 +gmx_mpi mdrun -nsteps 600000 $STRING_2 -gpu_id $STRING_3 \ 
 +-ntmpi 0 -npme 0 -s topol.tpr -ntomp 0 -pin on -nb gpu   
 + 
 +# Gromacs seems to have a mind of it's own 
 +On host n78 4 GPUs user-selected for this run. 
 +Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3 (-n<=4) 
 + 
 +Metric:          (ns/day)    (hour/ns) (x? = ??? ns/day) 
 + 
 +-n 1, -multidir 01, -gpu_id 0 
 +Using 4 MPI processes 
 +Using 8 OpenMP threads per MPI process 
 +Performance:      123.679        0.194 (x1) 
 + 
 +-n2, -multidir 01 02, -gpu_id 01 
 +Using 2 MPI processes 
 +Using 8 OpenMP threads per MPI process 
 +Performance:       95.920        0.250 (x2 = 191.84) 
 + 
 +-n 4, -multidir 01 02 03 04, -gpu_id 0123 
 +Using 1 MPI process 
 +Using 8 OpenMP threads  
 +Performance:       87.220        0.275 (x4 = 348.88) 
 + 
 +n 8, -multidir 01 02 03 04 05 06 07 08, -gpu_id 00112233 
 +Using 1 MPI process 
 +Using 4 OpenMP threads                             
 +cudaMallocHost of size 1024128 bytes failed: all CUDA-capable devices are busy or unavailable 
 +Ahh, nvidia compute modes need to be -pm 0 & -c 0 for gromacs ... 
 +NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss. 
 +Performance:       45.070        0.533 (x8 = 360.56) 
 + 
 +-n 16 (max physical cpu cores), -multidir 01 02 ... 15 16, -gpu_id 0000111122223333 
 +Using 1 MPI process 
 +Using 2 OpenMP threads  
 +Mapping of GPU IDs to the 16 PP ranks in this node: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,
 +Performance:       19.814        1.211 (x16 = 317.024) 
 + 
 +# UPDATE Gromacs 2018, check out these new performance stats for -n 4, -gpu=4 
 + 
 +# K20, redone with cuda 9 
 + 
 +root@cottontail gpu]# egrep 'ns/day|Performance' 0[0-4]/md.log 
 +01/md.log:                 (ns/day)    (hour/ns) 
 +01/md.log:Performance:       74.275        0.323 
 +02/md.log:                 (ns/day)    (hour/ns) 
 +02/md.log:Performance:       74.111        0.324 
 +03/md.log:                 (ns/day)    (hour/ns) 
 +03/md.log:Performance:       73.965        0.324 
 +04/md.log:                 (ns/day)    (hour/ns) 
 +04/md.log:Performance:       74.207        0.323 
 + 
 +# GTX1080 cuda 8 
 +  
 +[hmeij@cottontail gpu]$ egrep 'ns/day|Performance' 0[1-4]/md.log 
 +01/md.log:                 (ns/day)    (hour/ns) 
 +01/md.log:Performance:      229.229        0.105 
 +02/md.log:                 (ns/day)    (hour/ns) 
 +02/md.log:Performance:      221.936        0.108 
 +03/md.log:                 (ns/day)    (hour/ns) 
 +03/md.log:Performance:      217.618        0.110 
 +04/md.log:                 (ns/day)    (hour/ns) 
 +04/md.log:Performance:      228.854        0.105 
 + 
 +Almost 900 ns/day for a single server. 
 + 
 +</code> 
 + 
 +  * Lammps 11Aug17 runs about 11x faster than K20 
 +    * used the colloid example, not sure if that's a good example 
 +    * like gromacs, lots of room for improvements  
 +    * used the double-double binary,surprised at speed  
 +      * single-double binary might run faster? 
 + 
 +<code> 
 + 
 +nvidia-smi -pm 0; nvidia-smi -c 0 
 +# gpu_id is done via CUDA_VISIBLE_DEVICES 
 +export CUDA_VISIBLE_DEVCES=$STRING_2 
 +# on n78 
 +/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile -n $STRING_1 \ 
 +/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu \ 
 +$STRING_3 -in in.colloid > /tmp/out ; grep tau /tmp/out 
 +# on n34 
 +/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh \ 
 +-hostfile /home/hmeij/sharptail/hostfile2 -np $STRING_1 \ 
 +/share/apps/CENTOS6/lammps/31Mar17/lmp_gpu_double \ 
 +-suffix gpu $STRING_3  -in in.colloid > /tmp/out ; grep tau /tmp/out 
 + 
 + 
 + 
 +Created 5625 atoms 
 + 
 +-n 1, -gpu_id 0 
 +Performance: 581,359 tau/day, 1,345 timesteps/s  
 +-n 2, -gpu_id 01 
 +Performance: 621,822 tau/day, 1,439 timesteps/s  
 +-n 4, -gpu_id 0123 
 +Performance: 479,795 tau/day, 1,110 timesteps/s  
 + 
 +-n 4, -gpu_id 01, -pk gpu 2 
 +Performance: 819,207 tau/day, 1,896 timesteps/s  
 +-n 8, -gpu_id 01, -pk gpu 2 
 +Performance: 519,173 tau/day, 1,201 timesteps/s  
 +-n 6, -gpu_id 0123, -pk gpu 4 
 +Performance: 881,981 tau/day, 2,041 timesteps/
 +-n 8, -gpu_id 0123, -pk gpu 4 
 +Performance: 932,493 tau/day, 2,158 timesteps/s (11x K20) 
 +-n 16, -gpu_id 0123, -pk gpu 4 
 +Performance: 582,717 tau/day, 1,348 timesteps/
 + 
 + 
 +K20 on n34  
 + 
 +-n8, -gpuid 0123, -pk gpu 4 
 +Performance: 84985 tau/day, 196 timesteps/s  
 + 
 + 
 +GTX on n78 again  
 +-n 8, -gpu_id 0123, -pk gpu 4 
 + 
 +Created 22500 atoms 
 +Performance: 552,986 tau/day, 1,280 timesteps/
 +Created 90000 atoms 
 +Performance: 210,864 tau/day, 488 timesteps/
 + 
 + 
 +</code>
  
 ==== Scripts ==== ==== Scripts ====
Line 405: Line 591:
 </code> </code>
  
 +==== PPMA Bench ====
 +
 +  * Runs fastest when constrined to one gpu with 4 mpi threads
 +  * Room for improvement as gpu and gpu memory are not fully utilized
 +  * Adding mpi threads or more gpus reduces ns/day performance
 +  * No idea if adding omp threads shows a different picture
 +  * No idea how it compares to K20 gpus
 +
 +<code>
 +
 +nvidia-smi -pm 0; nvidia-smi -c 0
 +# gpu_id is done via CUDA_VISIBLE_DEVICES
 +export CUDA_VISIBLE_DEVCES=[0,1,2,3]
 +
 +# on n78
 +cd /home/hmeij/lammps/benchmark
 +rm -f /tmp/lmp-run.log;rm -f *.jpg;\
 +time /usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile  -n $STRING_1 \
 +/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu -pk gpu $STRING_2 \
 +-in nvt.in -var t 310 > /dev/null 2>&1; grep ^Performance /tmp/lmp-run.log
 +
 +
 +PMMA Benchmark Performance Metric ns/day (x  nr of gpus for node output)
 +
 +
 +Lammps 11Aug17 on GTX1080Ti (n78)
 +
 +-n 1, -gpu_id 3
 +Performance: 19.974 ns/day, 1.202 hours/ns, 231.176 timesteps/s
 +3, GeForce GTX 1080 Ti, 38, 219 MiB, 10953 MiB, 30 %, 1 %                                                      
 +-n 2, -gpu_id 3
 +Performance: 33.806 ns/day, 0.710 hours/ns, 391.277 timesteps/s
 +3, GeForce GTX 1080 Ti, 57, 358 MiB, 10814 MiB, 47 %, 3 %
 +-n 4, -gpu_id 3
 +Performance: 48.504 ns/day, 0.495 hours/ns, 561.388 timesteps/s (x4 = 194 ns/day/node)
 +3, GeForce GTX 1080 Ti, 59, 690 MiB, 10482 MiB, 76 %, 4 %
 +-n 8, -gpu_id 3
 +Performance: 37.742 ns/day, 0.636 hours/ns, 436.833 timesteps/s
 +3, GeForce GTX 1080 Ti, 47, 1332 MiB, 9840 MiB, 90 %, 4 %
 +-n 4, -gpu_id 01
 +Performance: 57.621 ns/day, 0.417 hours/ns, 666.912 timesteps/
 +0, GeForce GTX 1080 Ti, 48, 350 MiB, 10822 MiB, 50 %, 3 %
 +1, GeForce GTX 1080 Ti, 37, 344 MiB, 10828 MiB, 49 %, 3 %
 +-n 8, -gpu_id 01
 +Performance: 63.625 ns/day, 0.377 hours/ns, 736.400 timesteps/s (x2 = 127 ns/day/node)
 +0, GeForce GTX 1080 Ti, 66, 670 MiB, 10502 MiB, 77 %, 4 %
 +1, GeForce GTX 1080 Ti, 51, 670 MiB, 10502 MiB, 81 %, 4 %
 +-n 12, -gpu_id 01
 +Performance: 61.198 ns/day, 0.392 hours/ns, 708.315 timesteps/s
 +0, GeForce GTX 1080 Ti, 65, 988 MiB, 10184 MiB, 82 %, 4 %
 +1, GeForce GTX 1080 Ti, 50, 990 MiB, 10182 MiB, 85 %, 4 %
 +-n 8, -gpu_id 0123
 +Performance: 86.273 ns/day, 0.278 hours/ns, 998.534 timesteps/
 +0, GeForce GTX 1080 Ti, 56, 340 MiB, 10832 MiB, 57 %, 3 %
 +1, GeForce GTX 1080 Ti, 41, 340 MiB, 10832 MiB, 52 %, 2 %
 +2, GeForce GTX 1080 Ti, 43, 340 MiB, 10832 MiB, 57 %, 3 %
 +3, GeForce GTX 1080 Ti, 42, 340 MiB, 10832 MiB, 55 %, 2 %
 +-n 12, -gpuid 0123
 +Performance: 108.905 ns/day, 0.220 hours/ns, 1260.478 timesteps/s (x1 = 109 ns/day/node)
 +-n 16
 +Performance: 88.989 ns/day, 0.270 hours/ns, 1029.964 timesteps/s
 +
 +
 +
 +# on n34
 +unable to get it to run...
 +
 +K20 on n34 
 +
 +-n 1, -gpu_id 0
 +-n 4, -gpu_id 0
 +-n 4, -gpuid 0123
 +
 +# comparison of binaries running PMMA
 +# 1 gpu 4 mpi threads each run
 +
 +# lmp_mpi-double-double-with-gpu.log
 +Performance: 49.833 ns/day, 0.482 hours/ns, 576.769 timesteps/s
 +# lmp_mpi-single-double-with-gpu.log
 +Performance: 58.484 ns/day, 0.410 hours/ns, 676.899 timesteps/s
 +# lmp_mpi-single-single-with-gpu.log
 +Performance: 56.660 ns/day, 0.424 hours/ns, 655.793 timesteps/s
 +
 +</code>
 +
 +==== FSL ====
 +
 +**User Time Reported** from time command
 +
 +  * mwgpu cpu run
 +  * 2013 model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
 +    * All tests 45m
 +    * Bft test 16m28s (bedpostx)
 +
 +  * amber128 cpu run
 +  * 2017 model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
 +    * All tests 17m - 2.5x faster
 +    * Bft test 3m39s - 6x faster (bedpostx)
 +
 +  * amber128 gpu run
 +  * 2017 CUDA Device Name: GeForce GTX 1080 Ti
 +    * Bft gpu test 0m1.881s (what!? from command line) - 116x faster (bedpostx_gpu)
 +    * Bft gpu test 0m1.850s (what!? via scheduler) - 118x faster (bedpostx_gpu)
 +
 +
 +==== FreeSurfer ====
 +
 +
 +  * http://freesurfer.net/fswiki/DownloadAndInstall#TestyourFreeSurferInstallation
 +  * Example using sample-001.mgz
 +
 +<code>
 +
 +Node n37 (mwgpu cpu run)
 +(2013) Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
 +recon-all -s bert finished without error
 +example 1 user    0m3.516s
 +example 2 user    893m1.761s ~15 hours
 +example 3 user    ???m       ~15 hours (estimated)
 +
 +Node n78 (amber128 cpu run)
 +(2017) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
 +recon-all -s bert finished without error
 +example 1 user    0m2.315s
 +example 2 user    488m49.215s ~8 hours
 +example 3 user    478m44.622s ~8 hours
 +
 +
 +freeview -v \
 +    bert/mri/T1.mgz \
 +    bert/mri/wm.mgz \
 +    bert/mri/brainmask.mgz \
 +    bert/mri/aseg.mgz:colormap=lut:opacity=0.2 \
 +    -f \
 +    bert/surf/lh.white:edgecolor=blue \
 +    bert/surf/lh.pial:edgecolor=red \
 +    bert/surf/rh.white:edgecolor=blue \
 +    bert/surf/rh.pial:edgecolor=red
 +
 +
 +</code>
 +
 +Development code for the GPU http://surfer.nmr.mgh.harvard.edu/fswiki/freesurfer_linux_developers_page
 + 
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/164.1508849851.txt.gz · Last modified: 2017/10/24 08:57 by hmeij07