Differences

This shows you the differences between two versions of the page.

--- cluster:208 [2021/10/15 19:29]
hmeij07 [Overview]
+++ cluster:208 [2021/10/18 17:41]
hmeij07 [Changes]
@@ Line 384: / Line 384: @@
 ===== Changes =====
+** OverSubscribe **
 Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''.
@@ Line 401: / Line 404: @@
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//
+** GPU-CPU cores **
+Noticed this with debug level on in slurmd.log
+<code>
+# n37: old gpu model bound to all physical cpu cores
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,0,-1 /dev/nvidia3
+# n78: somewhat dated gpu model, bound to top/bot of physical cores (16)
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,0,-1 /dev/nvidia3
+# n79, more recent gpu model, same bound pattern of top/bot (24)
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,0,-1 /dev/nvidia3
+</code>
+** Weight Priority **
+Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).
+hp12: 12/8 = 1.5
+tinymem: 32/20 = 1.6
+mw128: 128/24 = 5.333333
+mw256: 256/16 = 16
+exx96: 96/24 = 4
+amber128: 128/16 = 8
+mwgpu = 256/16 = 16
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools