This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:164 [2017/10/24 19:22] hmeij07 [Bench] |
cluster:164 [2018/09/21 11:59] (current) hmeij07 |
||
---|---|---|---|
Line 94: | Line 94: | ||
==== Bench ==== | ==== Bench ==== | ||
- | * Amber 16. My sample script | + | * Amber 16. Nucleosome bench runs 4.5x faster than on a K20 |
- | * Do not have enough expertise to assess this, need stats from Kelly | + | * Not sure it is representative of our work load |
+ | * Adding more MPI threads decreases performance | ||
+ | * Running across more gpus (2 or 4) decreases performance | ||
+ | * One Amber process per MPI thread per GPU is optimal | ||
+ | |||
+ | **Wow, I just realized the most important metric: Our k20 has a job throughput of 20 per unit of time. The amber128 queue will have a throughput of 4*4.5 or 18 per same unit of time. One new server matches five old ones, well purchased in 2013. From an amber only perspective.** | ||
+ | |||
+ | < | ||
+ | |||
+ | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
+ | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
+ | export CUDA_VISIBLE_DEVICES=$STRING_2 | ||
+ | # on n78 | ||
+ | / | ||
+ | -n $STRING_1 $AMBERHOME/ | ||
+ | -p prmtop -c inpcrd -ref inpcrd ; grep ' | ||
+ | # on n34 | ||
+ | / | ||
+ | -np $STRING_1 | ||
+ | |||
+ | |||
+ | Nucleosome Metric ns/day, seconds/ | ||
+ | |||
+ | |||
+ | GTX on n78 | ||
+ | |||
+ | -n 1, -gpu_id 0 | ||
+ | | | ||
+ | -n 2, -gpu_id 0 | ||
+ | | | ||
+ | -n 4, -gpu_id 0 | ||
+ | | | ||
+ | -n 4, -gpu_id 01 | ||
+ | | | ||
+ | -n 8, -gpu_id 01 | ||
+ | | | ||
+ | -n 4, -gpu_id 0123 | ||
+ | | | ||
+ | -n 8, -gpu_id 0123 | ||
+ | | | ||
+ | |||
+ | |||
+ | K20 on n34 | ||
+ | |||
+ | -n 1, -gpu_id 0 | ||
+ | | | ||
+ | -n 4, -gpu_id 0 | ||
+ | | | ||
+ | -n4, -gpuid 0123 | ||
+ | | | ||
+ | |||
+ | |||
+ | |||
+ | </ | ||
* Gromacs 5.1.4 My (Colin' | * Gromacs 5.1.4 My (Colin' | ||
Line 142: | Line 195: | ||
Mapping of GPU IDs to the 16 PP ranks in this node: 0, | Mapping of GPU IDs to the 16 PP ranks in this node: 0, | ||
Performance: | Performance: | ||
+ | |||
+ | # UPDATE Gromacs 2018, check out these new performance stats for -n 4, -gpu=4 | ||
+ | |||
+ | # K20, redone with cuda 9 | ||
+ | |||
+ | root@cottontail gpu]# egrep ' | ||
+ | 01/ | ||
+ | 01/ | ||
+ | 02/ | ||
+ | 02/ | ||
+ | 03/ | ||
+ | 03/ | ||
+ | 04/ | ||
+ | 04/ | ||
+ | |||
+ | # GTX1080 cuda 8 | ||
+ | |||
+ | [hmeij@cottontail gpu]$ egrep ' | ||
+ | 01/ | ||
+ | 01/ | ||
+ | 02/ | ||
+ | 02/ | ||
+ | 03/ | ||
+ | 03/ | ||
+ | 04/ | ||
+ | 04/ | ||
+ | |||
+ | Almost 900 ns/day for a single server. | ||
</ | </ | ||
- | * Lammps | + | * Lammps |
+ | * used the colloid example, not sure if that's a good example | ||
+ | * like gromacs, lots of room for improvements | ||
+ | * used the double-double binary, | ||
+ | * single-double binary might run faster? | ||
< | < | ||
+ | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
+ | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
export CUDA_VISIBLE_DEVCES=$STRING_2 | export CUDA_VISIBLE_DEVCES=$STRING_2 | ||
+ | # on n78 | ||
/ | / | ||
/ | / | ||
- | $STRING_3 -in in.colloid > /tmp/out | + | $STRING_3 -in in.colloid > / |
+ | # on n34 | ||
+ | / | ||
+ | -hostfile / | ||
+ | / | ||
+ | -suffix gpu $STRING_3 | ||
+ | |||
Created 5625 atoms | Created 5625 atoms | ||
- | -gpu_id is done via CUDA_VISIBLE_DEVICES | ||
- | lmp_mpi-double-double-with-gpu with -suffix gpu | ||
-n 1, -gpu_id 0 | -n 1, -gpu_id 0 | ||
- | Performance: | + | Performance: |
- | -n 2, -gpu_id | + | -n 2, -gpu_id |
- | Performance: | + | Performance: |
- | -n 4, gpu_id | + | -n 4, -gpu_id |
- | Performance: | + | Performance: |
-n 4, -gpu_id 01, -pk gpu 2 | -n 4, -gpu_id 01, -pk gpu 2 | ||
- | Performance: | + | Performance: |
-n 8, -gpu_id 01, -pk gpu 2 | -n 8, -gpu_id 01, -pk gpu 2 | ||
- | Performance: | + | Performance: |
-n 6, -gpu_id 0123, -pk gpu 4 | -n 6, -gpu_id 0123, -pk gpu 4 | ||
Performance: | Performance: | ||
-n 8, -gpu_id 0123, -pk gpu 4 | -n 8, -gpu_id 0123, -pk gpu 4 | ||
- | Performance: | + | Performance: |
- | -n 10, -gpu_id 0123, -pk gpu 4 | + | |
-n 16, -gpu_id 0123, -pk gpu 4 | -n 16, -gpu_id 0123, -pk gpu 4 | ||
Performance: | Performance: | ||
- | K20 n34 last example | + | K20 on n34 |
- | / | + | |
- | -hostfile / | + | -n8, -gpuid 0123, -pk gpu 4 |
- | /share/apps/ | + | Performance: |
- | -sf gpu -pk gpu 4 -in in.colloid > /tmp/out ; grep tau /tmp/out | + | |
+ | |||
+ | GTX on n78 again | ||
+ | -n 8, -gpu_id 0123, -pk gpu 4 | ||
+ | |||
+ | Created 22500 atoms | ||
+ | Performance: | ||
+ | Created 90000 atoms | ||
+ | Performance: | ||
</ | </ | ||
Line 491: | Line 591: | ||
</ | </ | ||
+ | ==== PPMA Bench ==== | ||
+ | |||
+ | * Runs fastest when constrined to one gpu with 4 mpi threads | ||
+ | * Room for improvement as gpu and gpu memory are not fully utilized | ||
+ | * Adding mpi threads or more gpus reduces ns/day performance | ||
+ | * No idea if adding omp threads shows a different picture | ||
+ | * No idea how it compares to K20 gpus | ||
+ | |||
+ | < | ||
+ | |||
+ | nvidia-smi -pm 0; nvidia-smi -c 0 | ||
+ | # gpu_id is done via CUDA_VISIBLE_DEVICES | ||
+ | export CUDA_VISIBLE_DEVCES=[0, | ||
+ | |||
+ | # on n78 | ||
+ | cd / | ||
+ | rm -f / | ||
+ | time / | ||
+ | / | ||
+ | -in nvt.in -var t 310 > /dev/null 2>& | ||
+ | |||
+ | |||
+ | PMMA Benchmark Performance Metric ns/day (x nr of gpus for node output) | ||
+ | |||
+ | |||
+ | Lammps 11Aug17 on GTX1080Ti (n78) | ||
+ | |||
+ | -n 1, -gpu_id 3 | ||
+ | Performance: | ||
+ | 3, GeForce GTX 1080 Ti, 38, 219 MiB, 10953 MiB, 30 %, 1 % | ||
+ | -n 2, -gpu_id 3 | ||
+ | Performance: | ||
+ | 3, GeForce GTX 1080 Ti, 57, 358 MiB, 10814 MiB, 47 %, 3 % | ||
+ | -n 4, -gpu_id 3 | ||
+ | Performance: | ||
+ | 3, GeForce GTX 1080 Ti, 59, 690 MiB, 10482 MiB, 76 %, 4 % | ||
+ | -n 8, -gpu_id 3 | ||
+ | Performance: | ||
+ | 3, GeForce GTX 1080 Ti, 47, 1332 MiB, 9840 MiB, 90 %, 4 % | ||
+ | -n 4, -gpu_id 01 | ||
+ | Performance: | ||
+ | 0, GeForce GTX 1080 Ti, 48, 350 MiB, 10822 MiB, 50 %, 3 % | ||
+ | 1, GeForce GTX 1080 Ti, 37, 344 MiB, 10828 MiB, 49 %, 3 % | ||
+ | -n 8, -gpu_id 01 | ||
+ | Performance: | ||
+ | 0, GeForce GTX 1080 Ti, 66, 670 MiB, 10502 MiB, 77 %, 4 % | ||
+ | 1, GeForce GTX 1080 Ti, 51, 670 MiB, 10502 MiB, 81 %, 4 % | ||
+ | -n 12, -gpu_id 01 | ||
+ | Performance: | ||
+ | 0, GeForce GTX 1080 Ti, 65, 988 MiB, 10184 MiB, 82 %, 4 % | ||
+ | 1, GeForce GTX 1080 Ti, 50, 990 MiB, 10182 MiB, 85 %, 4 % | ||
+ | -n 8, -gpu_id 0123 | ||
+ | Performance: | ||
+ | 0, GeForce GTX 1080 Ti, 56, 340 MiB, 10832 MiB, 57 %, 3 % | ||
+ | 1, GeForce GTX 1080 Ti, 41, 340 MiB, 10832 MiB, 52 %, 2 % | ||
+ | 2, GeForce GTX 1080 Ti, 43, 340 MiB, 10832 MiB, 57 %, 3 % | ||
+ | 3, GeForce GTX 1080 Ti, 42, 340 MiB, 10832 MiB, 55 %, 2 % | ||
+ | -n 12, -gpuid 0123 | ||
+ | Performance: | ||
+ | -n 16 | ||
+ | Performance: | ||
+ | |||
+ | |||
+ | |||
+ | # on n34 | ||
+ | unable to get it to run... | ||
+ | |||
+ | K20 on n34 | ||
+ | |||
+ | -n 1, -gpu_id 0 | ||
+ | -n 4, -gpu_id 0 | ||
+ | -n 4, -gpuid 0123 | ||
+ | |||
+ | # comparison of binaries running PMMA | ||
+ | # 1 gpu 4 mpi threads each run | ||
+ | |||
+ | # lmp_mpi-double-double-with-gpu.log | ||
+ | Performance: | ||
+ | # lmp_mpi-single-double-with-gpu.log | ||
+ | Performance: | ||
+ | # lmp_mpi-single-single-with-gpu.log | ||
+ | Performance: | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== FSL ==== | ||
+ | |||
+ | **User Time Reported** from time command | ||
+ | |||
+ | * mwgpu cpu run | ||
+ | * 2013 model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz | ||
+ | * All tests 45m | ||
+ | * Bft test 16m28s (bedpostx) | ||
+ | |||
+ | * amber128 cpu run | ||
+ | * 2017 model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | ||
+ | * All tests 17m - 2.5x faster | ||
+ | * Bft test 3m39s - 6x faster (bedpostx) | ||
+ | |||
+ | * amber128 gpu run | ||
+ | * 2017 CUDA Device Name: GeForce GTX 1080 Ti | ||
+ | * Bft gpu test 0m1.881s (what!? from command line) - 116x faster (bedpostx_gpu) | ||
+ | * Bft gpu test 0m1.850s (what!? via scheduler) - 118x faster (bedpostx_gpu) | ||
+ | |||
+ | |||
+ | ==== FreeSurfer ==== | ||
+ | |||
+ | |||
+ | * http:// | ||
+ | * Example using sample-001.mgz | ||
+ | |||
+ | < | ||
+ | |||
+ | Node n37 (mwgpu cpu run) | ||
+ | (2013) Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz | ||
+ | recon-all -s bert finished without error | ||
+ | example 1 user 0m3.516s | ||
+ | example 2 user 893m1.761s ~15 hours | ||
+ | example 3 user ???m ~15 hours (estimated) | ||
+ | |||
+ | Node n78 (amber128 cpu run) | ||
+ | (2017) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | ||
+ | recon-all -s bert finished without error | ||
+ | example 1 user 0m2.315s | ||
+ | example 2 user 488m49.215s ~8 hours | ||
+ | example 3 user 478m44.622s ~8 hours | ||
+ | |||
+ | |||
+ | freeview -v \ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | -f \ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | bert/ | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | Development code for the GPU http:// | ||
+ | |||
\\ | \\ | ||
**[[cluster: | **[[cluster: |