This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:208 [2022/05/26 19:35] hmeij07 [gpu testing] |
cluster:208 [2022/05/27 13:13] hmeij07 [gpu testing] |
||
---|---|---|---|
Line 385: | Line 385: | ||
===== gpu testing ===== | ===== gpu testing ===== | ||
- | * n33 only, 4 gpus, 16 cores, 16 threads, 32 cpus | + | |
+ | | ||
* submit one at a time, observe | * submit one at a time, observe | ||
- | * part=test, n 1, B 1:1:1, cuda_visible=0, | + | * part=test, n 1, B 1:1:1, cuda_visible=0, |
* " | * " | ||
* all on same gpu | * all on same gpu | ||
- | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified | + | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only |
* " | * " | ||
* all gpus used? nope, all on the same one 0 | * all gpus used? nope, all on the same one 0 | ||
* redoing above with a '' | * redoing above with a '' | ||
- | * | + | * even distribution across all gpus, 17th submit reason too |
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail | ||
+ | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
+ | * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | ||
+ | * 33th job, " | ||
+ | * 34th job, " | ||
+ | * all n33 and n34 jobs on single gpu without cuda_visible set | ||
+ | * how that works with gpu util at 100% with one jobs is beyond me | ||
+ | * do all 16 jobs log the same wall time? | ||
===== Changes ===== | ===== Changes ===== | ||