User Tools

Site Tools


cluster:208

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:208 [2021/10/15 19:18]
hmeij07 [Changes]
cluster:208 [2022/11/02 17:28] (current)
hmeij07 [gpu testing]
Line 151: Line 151:
 On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with ''--exclusive'' (test queue only) or share nodes (other queues, with ''--oversubscribe'') On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with ''--exclusive'' (test queue only) or share nodes (other queues, with ''--oversubscribe'')
  
-//Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending?//  +//Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? Solved, see Changes section.//  
  
 ===== MPI ===== ===== MPI =====
Line 382: Line 382:
  
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 09:16//  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 09:16//
 +
 +===== gpu testing =====
 +
 +  * test standalone slurm v 21.08.1
 +  * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus
 +  * submit one at a time, observe  
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n33 only
 +  * "resources" reason at 17th submit, used up 16 cores and 16 threads
 +  * all on same gpu
 +  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only
 +  * "resources" reason at 17th submit too, same reason
 +  * all gpus used? nope, all on the same one 0
 +  * redoing above with a  ''export CUDA_VISIBLE_DEVICES=`shuf -i 0-3 -n 1`''
 +  * even distribution across all gpus, 17th submit reason too
 +  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail
 +  * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0)
 +  * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, n34 state=mix
 +  * 33th job, "Resources" reason, job pending
 +  * 34th job, "Priority" reason (?), job pending
 +  * all n33 and n34 jobs on single gpu without cuda_visible set
 +  * how that works with gpu util at 100% with one jobs is beyond me
 +  * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours.
 +
 +  * ohpc v2.4 slurm v 20.11.8 
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n100 only
 +  * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu
 +  * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu
 +  * twisted logic
 +  * so recent openhpc version but old slurm version in software stack
 +  * trying standalone install on openhpc prod cluster - auth/munge error, no go
 +  * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours
 +
 +  * ohpc v2.4 slurm v 20.11.8 
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n78 only
 +  * same as above but all 16 jobs run on gpu 0
 +  * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon?
 +  * all 16 jobs finished, waal times of 3.11 to 3.60 hours
 +
 +===== gpu testing 2 =====
 +
 +Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing...back to CR_CPU_Memory and OverSubscribe=No   --- //[[hmeij@wesleyan.edu|Henk]] 2022/11/02 13:23//
 +
 +<code>
 +
 +CR_Socket_Memory
 +PartitionName=test Nodes=n[100-101] 
 +Default=YES MaxTime=INFINITE State=UP 
 +OverSubscribe=No DefCpuPerGPU=12
 +
 +MPI jobs with -N 1, -n 8 and -B 2:4:1
 +no override options, cpus=48
 +--mem=2048, cpus=48
 +and --cpus-per-task=1, cpus=48
 +and  --ntasks-per-node=8, cpus=24
 +
 +MPI jobs with -N, -n 8 and -B 1:8:1
 +--mem=10240 cpus=48
 +and --cpus-per-task=1, cpus=48
 +and  --ntasks-per-node=8, cpus=24
 +
 +GPU jobs with -N 1, -n 1 and -B 1:1:1 
 +no override options, no cuda export, cpus=48
 +--cpus-per-gpu=1, cpus=24
 +and --mem-per-gpu=7168, cpus=1 (pending
 +while other gpu runs in queue but gpus are free???)
 +
 +GPU jobs with -N 1, -n 1 and -B 1:1:1 
 +no override options, yes cuda export, cpus=48
 +--cpus-per-gpu=1, cpus=24
 +and --mem-per-gpu=7168, cpus=1 (resources pending
 +while a gpu job runs, gpus are free, then it executes)
 +
 +...suddenly the cpus=1 turns into cpus=24
 +when submitting, slurm confused becuase of all
 +the job cancellations?
 +
 +CR_CPU_Memory test=no, mwgpu=force:16
 +PartitionName=test Nodes=n[100-101] 
 +Default=YES MaxTime=INFINITE State=UP 
 +OverSubscribe=No DefCpuPerGPU=12
 +
 +MPI jobs with -N 1, -n 8 and -B 2:4:1
 +no override options, cpus=8 (queue fills across nodes,
 +but only one job per node, test & mwgpu)
 +--mem=1024, cpus=8 (queue fills first node ...,
 +but only three jobs per node, test 3x8=24 full 4th job pending & 
 +mwgpu 17th job goes pending on n33, overloaded with -n 8 !!)
 +(not needed) --cpus-per-task=?, cpus=
 +(not needed)  --ntasks-per-node=?, cpus=
 +
 +
 +GPU jobs with -N 1, -n 1 and -B 1:1:1 on test
 +no override options, no cuda export, cpus=12 (one gpu per node)
 +--cpus-per-gpu=1, cpus=1 (one gpu per node)
 +and --mem-per-gpu=7168, cpus=1 (both override options
 +required else all mem allocated!, max 4 jobs per node,
 +fills first node first...cuda export not needed)
 +with cuda export, same node, same gpu,
 +with "no" enabled multiple jobs per gpu not accepted
 +
 +
 +GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu
 +--cpus-per-gpu=1,
 +and --mem-per-gpu=7168, cpus=1 
 +(same node, same gpu, cuda export set, 
 +with "force:16" enabled 4 jobs per gpu accepted,
 +potential for overloading!)
 +
 +</code>
 +
 +
 +
 +
  
 ===== Changes ===== ===== Changes =====
 +
 +
 +** OverSubscribe **
  
 Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''. Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''.
Line 400: Line 516:
    
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//
 +
 +** GPU-CPU cores **
 +
 +Noticed this with debug level on in slurmd.log. No action taken.
 +
 +<code>
 + 
 +# n37: old gpu model bound to all physical cpu cores
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,0,-1 /dev/nvidia3
 +
 +# n78: somewhat dated gpu model, bound to top/bot of physical cores (16)
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,0,-1 /dev/nvidia3
 +
 +# n79, more recent gpu model, same bound pattern of top/bot (24)
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:-1,0,0,0 /dev/nvidia0
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:0,-1,0,0 /dev/nvidia1
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,-1,0 /dev/nvidia2
 +GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,0,-1 /dev/nvidia3
 +
 +</code>
 +
 +** Partition Priority **
 +
 +If set you can list more than one queue...
 +
 +<code>
 + srun --partition=exx96,amber128,mwgpu  --mem=1024  --gpus=1  --gres=gpu:any sleep 60 &
 +</code>
 +
 +The above will fill up n79 first, then n78, then n36...
 +
 +** Node Weight Priority **
 +
 +Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).
 +<code>
 +hp12: 12/8 = 1.5
 +tinymem: 32/20 = 1.6
 +mw128: 128/24 = 5.333333
 +mw256: 256/16 = 16
 +
 +exx96: 96/24 = 4
 +amber128: 128/16 = 8
 +mwgpu = 256/16 = 16
 +</code>
 +
 +Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken.
 +
 +<code>
 +tinymem   10
 +mw128     20
 +mw256fd  30    HasMem256 feature so cpu jobs can directly target large mem
 +mwgpu    40    +  HasMem256 feature
 +amber128  50
 +exx96      80
 +</code>
 +
 +** CR_CPU_Memory **
 +
 +Makes for a better 1-1 relationship of physical core to ''ntask'' yet the "hyperthreads" are still available to user jobs but physical cores are consumed first, if I got all this right.
 +
 +Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node.
 + --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/18 14:32//
 +
 +We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults.
 +
 +<code>
 + srun --partition=test  --mem=1024  --gres=gpu:geforce_rtx_2080_s:1 sleep 60 &
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/208.1634325531.txt.gz ยท Last modified: 2021/10/15 19:18 by hmeij07