User Tools

Site Tools


cluster:208

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:208 [2021/10/15 19:29]
hmeij07 [Overview]
cluster:208 [2021/10/18 17:41]
hmeij07 [Changes]
Line 384: Line 384:
  
 ===== Changes ===== ===== Changes =====
 +
 +
 +** OverSubscribe **
  
 Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''. Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''.
Line 401: Line 404:
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//
  
 +** GPU-CPU cores **
 +
 +Noticed this with debug level on in slurmd.log
 +
 +<code>
 + 
 +# n37: old gpu model bound to all physical cpu cores
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,0,-1 /dev/nvidia3
 +
 +# n78: somewhat dated gpu model, bound to top/bot of physical cores (16)
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,0,-1 /dev/nvidia3
 +
 +# n79, more recent gpu model, same bound pattern of top/bot (24)
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,0,-1 /dev/nvidia3
 +
 +</code>
 +
 +** Weight Priority **
 +
 +Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).
 +
 +hp12: 12/8 = 1.5
 +tinymem: 32/20 = 1.6
 +mw128: 128/24 = 5.333333
 +mw256: 256/16 = 16
 +
 +exx96: 96/24 = 4
 +amber128: 128/16 = 8
 +mwgpu = 256/16 = 16
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/208.txt ยท Last modified: 2022/11/02 17:28 by hmeij07