Differences

This shows you the differences between two versions of the page.

--- cluster:208 [2022/05/26 19:35]
hmeij07 [gpu testing]
+++ cluster:208 [2022/05/27 13:13]
hmeij07 [gpu testing]
@@ Line 385: / Line 385: @@
 ===== gpu testing =====
-  * n33 only, 4 gpus, 16 cores, 16 threads, 32 cpus
+  * test slurm v 21.08.1
+  * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus
   * submit one at a time, observe
-  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified
+  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n33 only
   * "resources" reason at 17th submit, used up 16 cores and 16 threads
   * all on same gpu
-  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified
+  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only
   * "resources" reason at 17th submit too, same reason
   * all gpus used? nope, all on the same one 0
   * redoing above with a  ''export CUDA_VISIBLE_DEVICES=`shuf -i 0-3 -n 1`''
-  *
+  * even distribution across all gpus, 17th submit reason too
+  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail
+  * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0)
+  * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, n34 state=mix
+  * 33th job, "Resources" reason, job pending
+  * 34th job, "Priority" reason (?), job pending
+  * all n33 and n34 jobs on single gpu without cuda_visible set
+  * how that works with gpu util at 100% with one jobs is beyond me
+  * do all 16 jobs log the same wall time?
 ===== Changes =====

DokuWiki

User Tools

Site Tools

Differences

Page Tools