cluster:164
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:164 [2017/10/24 12:57] – hmeij07 | cluster:164 [2018/09/21 11:59] (current) – hmeij07 | ||
|---|---|---|---|
| Line 81: | Line 81: | ||
| 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 % | 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 % | ||
| 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 % | 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 % | ||
| - | |||
| # note, from nvidia-smi --help-query-gpu | # note, from nvidia-smi --help-query-gpu | ||
| " | " | ||
| Core GPU temperature. in degrees C. | Core GPU temperature. in degrees C. | ||
| + | |||
| + | From vendor " | ||
| + | |||
| </ | </ | ||
| Line 92: | Line 94: | ||
| ==== Bench ==== | ==== Bench ==== | ||
| - | * Amber 16. My sample script | + | * Amber 16. Nucleosome bench runs 4.5x faster than on a K20 |
| - | * Do not have enough expertise to assess this, need stats from Kelly | + | * Not sure it is representative of our work load |
| + | * Adding more MPI threads decreases performance | ||
| + | * Running across more gpus (2 or 4) decreases performance | ||
| + | * One Amber process per MPI thread per GPU is optimal | ||
| + | |||
| + | **Wow, I just realized the most important metric: Our k20 has a job throughput of 20 per unit of time. The amber128 queue will have a throughput of 4*4.5 or 18 per same unit of time. One new server matches five old ones, well purchased in 2013. From an amber only perspective.** | ||
| + | |||
| + | < | ||
| + | |||
| + | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
| + | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
| + | export CUDA_VISIBLE_DEVICES=$STRING_2 | ||
| + | # on n78 | ||
| + | / | ||
| + | -n $STRING_1 $AMBERHOME/ | ||
| + | -p prmtop -c inpcrd -ref inpcrd ; grep ' | ||
| + | # on n34 | ||
| + | / | ||
| + | -np $STRING_1 | ||
| + | |||
| + | |||
| + | Nucleosome Metric ns/day, seconds/ | ||
| + | |||
| + | |||
| + | GTX on n78 | ||
| + | |||
| + | -n 1, -gpu_id 0 | ||
| + | | | ||
| + | -n 2, -gpu_id 0 | ||
| + | | | ||
| + | -n 4, -gpu_id 0 | ||
| + | | | ||
| + | -n 4, -gpu_id 01 | ||
| + | | | ||
| + | -n 8, -gpu_id 01 | ||
| + | | | ||
| + | -n 4, -gpu_id 0123 | ||
| + | | | ||
| + | -n 8, -gpu_id 0123 | ||
| + | | | ||
| + | |||
| + | |||
| + | K20 on n34 | ||
| + | |||
| + | -n 1, -gpu_id 0 | ||
| + | | | ||
| + | -n 4, -gpu_id 0 | ||
| + | | | ||
| + | -n4, -gpuid 0123 | ||
| + | | | ||
| + | |||
| + | |||
| + | |||
| + | </ | ||
| * Gromacs 5.1.4 My (Colin' | * Gromacs 5.1.4 My (Colin' | ||
| Line 99: | Line 154: | ||
| * 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day | * 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day | ||
| - | | + | < |
| + | |||
| + | # about 20 mins per run | ||
| + | / | ||
| + | gmx_mpi mdrun -nsteps 600000 $STRING_2 -gpu_id $STRING_3 \ | ||
| + | -ntmpi 0 -npme 0 -s topol.tpr -ntomp 0 -pin on -nb gpu | ||
| + | |||
| + | # Gromacs seems to have a mind of it's own | ||
| + | On host n78 4 GPUs user-selected for this run. | ||
| + | Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3 (-n< | ||
| + | |||
| + | Metric: | ||
| + | |||
| + | -n 1, -multidir 01, -gpu_id 0 | ||
| + | Using 4 MPI processes | ||
| + | Using 8 OpenMP threads per MPI process | ||
| + | Performance: | ||
| + | |||
| + | -n2, -multidir 01 02, -gpu_id 01 | ||
| + | Using 2 MPI processes | ||
| + | Using 8 OpenMP threads per MPI process | ||
| + | Performance: | ||
| + | |||
| + | -n 4, -multidir 01 02 03 04, -gpu_id 0123 | ||
| + | Using 1 MPI process | ||
| + | Using 8 OpenMP threads | ||
| + | Performance: | ||
| + | |||
| + | n 8, -multidir 01 02 03 04 05 06 07 08, -gpu_id 00112233 | ||
| + | Using 1 MPI process | ||
| + | Using 4 OpenMP threads | ||
| + | cudaMallocHost of size 1024128 bytes failed: all CUDA-capable devices are busy or unavailable | ||
| + | Ahh, nvidia compute modes need to be -pm 0 & -c 0 for gromacs ... | ||
| + | NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss. | ||
| + | Performance: | ||
| + | |||
| + | -n 16 (max physical cpu cores), -multidir 01 02 ... 15 16, -gpu_id 0000111122223333 | ||
| + | Using 1 MPI process | ||
| + | Using 2 OpenMP threads | ||
| + | Mapping of GPU IDs to the 16 PP ranks in this node: 0, | ||
| + | Performance: | ||
| + | |||
| + | # UPDATE Gromacs 2018, check out these new performance stats for -n 4, -gpu=4 | ||
| + | |||
| + | # K20, redone with cuda 9 | ||
| + | |||
| + | root@cottontail gpu]# egrep ' | ||
| + | 01/ | ||
| + | 01/ | ||
| + | 02/ | ||
| + | 02/ | ||
| + | 03/ | ||
| + | 03/ | ||
| + | 04/ | ||
| + | 04/ | ||
| + | |||
| + | # GTX1080 cuda 8 | ||
| + | |||
| + | [hmeij@cottontail gpu]$ egrep ' | ||
| + | 01/ | ||
| + | 01/ | ||
| + | 02/ | ||
| + | 02/ | ||
| + | 03/ | ||
| + | 03/ | ||
| + | 04/ | ||
| + | 04/ | ||
| + | |||
| + | Almost 900 ns/day for a single server. | ||
| + | |||
| + | </ | ||
| + | |||
| + | | ||
| + | * used the colloid example, not sure if that's a good example | ||
| + | * like gromacs, lots of room for improvements | ||
| + | * used the double-double binary, | ||
| + | * single-double binary might run faster? | ||
| + | |||
| + | < | ||
| + | |||
| + | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
| + | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
| + | export CUDA_VISIBLE_DEVCES=$STRING_2 | ||
| + | # on n78 | ||
| + | / | ||
| + | / | ||
| + | $STRING_3 -in in.colloid > /tmp/out ; grep tau /tmp/out | ||
| + | # on n34 | ||
| + | / | ||
| + | -hostfile / | ||
| + | / | ||
| + | -suffix gpu $STRING_3 | ||
| + | |||
| + | |||
| + | |||
| + | Created 5625 atoms | ||
| + | |||
| + | -n 1, -gpu_id 0 | ||
| + | Performance: | ||
| + | -n 2, -gpu_id 01 | ||
| + | Performance: | ||
| + | -n 4, -gpu_id 0123 | ||
| + | Performance: | ||
| + | |||
| + | -n 4, -gpu_id 01, -pk gpu 2 | ||
| + | Performance: | ||
| + | -n 8, -gpu_id 01, -pk gpu 2 | ||
| + | Performance: | ||
| + | -n 6, -gpu_id 0123, -pk gpu 4 | ||
| + | Performance: | ||
| + | -n 8, -gpu_id 0123, -pk gpu 4 | ||
| + | Performance: | ||
| + | -n 16, -gpu_id 0123, -pk gpu 4 | ||
| + | Performance: | ||
| + | |||
| + | |||
| + | K20 on n34 | ||
| + | |||
| + | -n8, -gpuid 0123, -pk gpu 4 | ||
| + | Performance: | ||
| + | |||
| + | |||
| + | GTX on n78 again | ||
| + | -n 8, -gpu_id 0123, -pk gpu 4 | ||
| + | |||
| + | Created 22500 atoms | ||
| + | Performance: | ||
| + | Created 90000 atoms | ||
| + | Performance: | ||
| + | |||
| + | |||
| + | </ | ||
| ==== Scripts ==== | ==== Scripts ==== | ||
| Line 405: | Line 591: | ||
| </ | </ | ||
| + | ==== PPMA Bench ==== | ||
| + | |||
| + | * Runs fastest when constrined to one gpu with 4 mpi threads | ||
| + | * Room for improvement as gpu and gpu memory are not fully utilized | ||
| + | * Adding mpi threads or more gpus reduces ns/day performance | ||
| + | * No idea if adding omp threads shows a different picture | ||
| + | * No idea how it compares to K20 gpus | ||
| + | |||
| + | < | ||
| + | |||
| + | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
| + | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
| + | export CUDA_VISIBLE_DEVCES=[0, | ||
| + | |||
| + | # on n78 | ||
| + | cd / | ||
| + | rm -f / | ||
| + | time / | ||
| + | / | ||
| + | -in nvt.in -var t 310 > /dev/null 2>& | ||
| + | |||
| + | |||
| + | PMMA Benchmark Performance Metric ns/day (x nr of gpus for node output) | ||
| + | |||
| + | |||
| + | Lammps 11Aug17 on GTX1080Ti (n78) | ||
| + | |||
| + | -n 1, -gpu_id 3 | ||
| + | Performance: | ||
| + | 3, GeForce GTX 1080 Ti, 38, 219 MiB, 10953 MiB, 30 %, 1 % | ||
| + | -n 2, -gpu_id 3 | ||
| + | Performance: | ||
| + | 3, GeForce GTX 1080 Ti, 57, 358 MiB, 10814 MiB, 47 %, 3 % | ||
| + | -n 4, -gpu_id 3 | ||
| + | Performance: | ||
| + | 3, GeForce GTX 1080 Ti, 59, 690 MiB, 10482 MiB, 76 %, 4 % | ||
| + | -n 8, -gpu_id 3 | ||
| + | Performance: | ||
| + | 3, GeForce GTX 1080 Ti, 47, 1332 MiB, 9840 MiB, 90 %, 4 % | ||
| + | -n 4, -gpu_id 01 | ||
| + | Performance: | ||
| + | 0, GeForce GTX 1080 Ti, 48, 350 MiB, 10822 MiB, 50 %, 3 % | ||
| + | 1, GeForce GTX 1080 Ti, 37, 344 MiB, 10828 MiB, 49 %, 3 % | ||
| + | -n 8, -gpu_id 01 | ||
| + | Performance: | ||
| + | 0, GeForce GTX 1080 Ti, 66, 670 MiB, 10502 MiB, 77 %, 4 % | ||
| + | 1, GeForce GTX 1080 Ti, 51, 670 MiB, 10502 MiB, 81 %, 4 % | ||
| + | -n 12, -gpu_id 01 | ||
| + | Performance: | ||
| + | 0, GeForce GTX 1080 Ti, 65, 988 MiB, 10184 MiB, 82 %, 4 % | ||
| + | 1, GeForce GTX 1080 Ti, 50, 990 MiB, 10182 MiB, 85 %, 4 % | ||
| + | -n 8, -gpu_id 0123 | ||
| + | Performance: | ||
| + | 0, GeForce GTX 1080 Ti, 56, 340 MiB, 10832 MiB, 57 %, 3 % | ||
| + | 1, GeForce GTX 1080 Ti, 41, 340 MiB, 10832 MiB, 52 %, 2 % | ||
| + | 2, GeForce GTX 1080 Ti, 43, 340 MiB, 10832 MiB, 57 %, 3 % | ||
| + | 3, GeForce GTX 1080 Ti, 42, 340 MiB, 10832 MiB, 55 %, 2 % | ||
| + | -n 12, -gpuid 0123 | ||
| + | Performance: | ||
| + | -n 16 | ||
| + | Performance: | ||
| + | |||
| + | |||
| + | |||
| + | # on n34 | ||
| + | unable to get it to run... | ||
| + | |||
| + | K20 on n34 | ||
| + | |||
| + | -n 1, -gpu_id 0 | ||
| + | -n 4, -gpu_id 0 | ||
| + | -n 4, -gpuid 0123 | ||
| + | |||
| + | # comparison of binaries running PMMA | ||
| + | # 1 gpu 4 mpi threads each run | ||
| + | |||
| + | # lmp_mpi-double-double-with-gpu.log | ||
| + | Performance: | ||
| + | # lmp_mpi-single-double-with-gpu.log | ||
| + | Performance: | ||
| + | # lmp_mpi-single-single-with-gpu.log | ||
| + | Performance: | ||
| + | |||
| + | </ | ||
| + | |||
| + | ==== FSL ==== | ||
| + | |||
| + | **User Time Reported** from time command | ||
| + | |||
| + | * mwgpu cpu run | ||
| + | * 2013 model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz | ||
| + | * All tests 45m | ||
| + | * Bft test 16m28s (bedpostx) | ||
| + | |||
| + | * amber128 cpu run | ||
| + | * 2017 model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | ||
| + | * All tests 17m - 2.5x faster | ||
| + | * Bft test 3m39s - 6x faster (bedpostx) | ||
| + | |||
| + | * amber128 gpu run | ||
| + | * 2017 CUDA Device Name: GeForce GTX 1080 Ti | ||
| + | * Bft gpu test 0m1.881s (what!? from command line) - 116x faster (bedpostx_gpu) | ||
| + | * Bft gpu test 0m1.850s (what!? via scheduler) - 118x faster (bedpostx_gpu) | ||
| + | |||
| + | |||
| + | ==== FreeSurfer ==== | ||
| + | |||
| + | |||
| + | * http:// | ||
| + | * Example using sample-001.mgz | ||
| + | |||
| + | < | ||
| + | |||
| + | Node n37 (mwgpu cpu run) | ||
| + | (2013) Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz | ||
| + | recon-all -s bert finished without error | ||
| + | example 1 user 0m3.516s | ||
| + | example 2 user 893m1.761s ~15 hours | ||
| + | example 3 user ???m ~15 hours (estimated) | ||
| + | |||
| + | Node n78 (amber128 cpu run) | ||
| + | (2017) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | ||
| + | recon-all -s bert finished without error | ||
| + | example 1 user 0m2.315s | ||
| + | example 2 user 488m49.215s ~8 hours | ||
| + | example 3 user 478m44.622s ~8 hours | ||
| + | |||
| + | |||
| + | freeview -v \ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | -f \ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | bert/ | ||
| + | |||
| + | |||
| + | </ | ||
| + | |||
| + | Development code for the GPU http:// | ||
| + | |||
| \\ | \\ | ||
| **[[cluster: | **[[cluster: | ||
cluster/164.1508849851.txt.gz · Last modified: by hmeij07
