This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
cluster:208 [2022/05/27 13:05] hmeij07 [gpu testing] |
cluster:208 [2022/06/03 12:30] hmeij07 [gpu testing] |
||
---|---|---|---|
Line 385: | Line 385: | ||
===== gpu testing ===== | ===== gpu testing ===== | ||
- | * test slurm v 21.08.1 | + | * test standalone |
* n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus | * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus | ||
* submit one at a time, observe | * submit one at a time, observe | ||
Line 399: | Line 399: | ||
* while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
* 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | ||
+ | * 33th job, " | ||
+ | * 34th job, " | ||
+ | * all n33 and n34 jobs on single gpu without cuda_visible set | ||
+ | * how that works with gpu util at 100% with one jobs is beyond me | ||
+ | * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours. | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu | ||
+ | * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu | ||
+ | * twisted logic | ||
+ | * so recent openhpc version but old slurm version in software stack | ||
+ | * trying standalone install on openhpc prod cluster - auth/munge error, no go | ||
+ | * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * same as above but all 16 jobs run on gpu 0 | ||
+ | * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon? | ||
+ | * all 16 jobs finished, waal times of 3.11 to 3.60 hours | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
===== Changes ===== | ===== Changes ===== | ||