This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:208 [2021/10/14 19:39] hmeij07 |
cluster:208 [2022/11/02 17:28] (current) hmeij07 [gpu testing] |
||
---|---|---|---|
Line 8: | Line 8: | ||
There is a techie page at this location **[[cluster: | There is a techie page at this location **[[cluster: | ||
- | __This page is intended for users__ to get started with the Slurm scheduler. '' | + | __This page is intended for users__ to get started with the Slurm scheduler. '' |
** Default Environment ** | ** Default Environment ** | ||
Line 76: | Line 76: | ||
$ scontrol show node n78 | $ scontrol show node n78 | ||
NodeName=n78 Arch=x86_64 CoresPerSocket=8 | NodeName=n78 Arch=x86_64 CoresPerSocket=8 | ||
- | | + | |
- | | + | |
| | ||
- | | + | |
| | ||
| | ||
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
+ | | ||
| | ||
| | ||
| | ||
+ | |||
# sorta like bhist -l | # sorta like bhist -l | ||
Line 147: | Line 149: | ||
Same on the cpu only compute nodes. Features could be created for memory footprints (for example " | Same on the cpu only compute nodes. Features could be created for memory footprints (for example " | ||
- | On the resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' | + | On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' |
- | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing.// | + | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? Solved, see Changes section.// |
===== MPI ===== | ===== MPI ===== | ||
Line 155: | Line 157: | ||
Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the '' | Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the '' | ||
- | For now, we'll rely on PATH/ | + | For now, we'll rely on PATH/ |
'' | '' | ||
Line 161: | Line 163: | ||
< | < | ||
- | $ srun --partition=mwgpu -n 4 -B 1:1:1 --mem=1024 sleep 60 & | + | $ srun --partition=mwgpu -n 4 -B 1:4:1 --mem=1024 sleep 60 & |
</ | </ | ||
Line 377: | Line 379: | ||
===== Feedback ===== | ===== Feedback ===== | ||
- | If there are errors on this page, or mistatements, | + | If there are errors on this page, or mistatements, |
- | --- // | + | --- // |
+ | ===== gpu testing ===== | ||
+ | |||
+ | * test standalone slurm v 21.08.1 | ||
+ | * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus | ||
+ | * submit one at a time, observe | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * " | ||
+ | * all on same gpu | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only | ||
+ | * " | ||
+ | * all gpus used? nope, all on the same one 0 | ||
+ | * redoing above with a '' | ||
+ | * even distribution across all gpus, 17th submit reason too | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail | ||
+ | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
+ | * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | ||
+ | * 33th job, " | ||
+ | * 34th job, " | ||
+ | * all n33 and n34 jobs on single gpu without cuda_visible set | ||
+ | * how that works with gpu util at 100% with one jobs is beyond me | ||
+ | * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours. | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu | ||
+ | * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu | ||
+ | * twisted logic | ||
+ | * so recent openhpc version but old slurm version in software stack | ||
+ | * trying standalone install on openhpc prod cluster - auth/munge error, no go | ||
+ | * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * same as above but all 16 jobs run on gpu 0 | ||
+ | * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon? | ||
+ | * all 16 jobs finished, waal times of 3.11 to 3.60 hours | ||
+ | |||
+ | ===== gpu testing 2 ===== | ||
+ | |||
+ | Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing...back to CR_CPU_Memory and OverSubscribe=No | ||
+ | |||
+ | < | ||
+ | |||
+ | CR_Socket_Memory | ||
+ | PartitionName=test Nodes=n[100-101] | ||
+ | Default=YES MaxTime=INFINITE State=UP | ||
+ | OverSubscribe=No DefCpuPerGPU=12 | ||
+ | |||
+ | MPI jobs with -N 1, -n 8 and -B 2:4:1 | ||
+ | no override options, cpus=48 | ||
+ | --mem=2048, cpus=48 | ||
+ | and --cpus-per-task=1, | ||
+ | and --ntasks-per-node=8, | ||
+ | |||
+ | MPI jobs with -N, -n 8 and -B 1:8:1 | ||
+ | --mem=10240 cpus=48 | ||
+ | and --cpus-per-task=1, | ||
+ | and --ntasks-per-node=8, | ||
+ | |||
+ | GPU jobs with -N 1, -n 1 and -B 1:1:1 | ||
+ | no override options, no cuda export, cpus=48 | ||
+ | --cpus-per-gpu=1, | ||
+ | and --mem-per-gpu=7168, | ||
+ | while other gpu runs in queue but gpus are free???) | ||
+ | |||
+ | GPU jobs with -N 1, -n 1 and -B 1:1:1 | ||
+ | no override options, yes cuda export, cpus=48 | ||
+ | --cpus-per-gpu=1, | ||
+ | and --mem-per-gpu=7168, | ||
+ | while a gpu job runs, gpus are free, then it executes) | ||
+ | |||
+ | ...suddenly the cpus=1 turns into cpus=24 | ||
+ | when submitting, slurm confused becuase of all | ||
+ | the job cancellations? | ||
+ | |||
+ | CR_CPU_Memory test=no, mwgpu=force: | ||
+ | PartitionName=test Nodes=n[100-101] | ||
+ | Default=YES MaxTime=INFINITE State=UP | ||
+ | OverSubscribe=No DefCpuPerGPU=12 | ||
+ | |||
+ | MPI jobs with -N 1, -n 8 and -B 2:4:1 | ||
+ | no override options, cpus=8 (queue fills across nodes, | ||
+ | but only one job per node, test & mwgpu) | ||
+ | --mem=1024, cpus=8 (queue fills first node ..., | ||
+ | but only three jobs per node, test 3x8=24 full 4th job pending & | ||
+ | mwgpu 17th job goes pending on n33, overloaded with -n 8 !!) | ||
+ | (not needed) --cpus-per-task=?, | ||
+ | (not needed) | ||
+ | |||
+ | |||
+ | GPU jobs with -N 1, -n 1 and -B 1:1:1 on test | ||
+ | no override options, no cuda export, cpus=12 (one gpu per node) | ||
+ | --cpus-per-gpu=1, | ||
+ | and --mem-per-gpu=7168, | ||
+ | required else all mem allocated!, max 4 jobs per node, | ||
+ | fills first node first...cuda export not needed) | ||
+ | with cuda export, same node, same gpu, | ||
+ | with " | ||
+ | |||
+ | |||
+ | GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu | ||
+ | --cpus-per-gpu=1, | ||
+ | and --mem-per-gpu=7168, | ||
+ | (same node, same gpu, cuda export set, | ||
+ | with " | ||
+ | potential for overloading!) | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | |||
+ | ** OverSubscribe ** | ||
+ | |||
+ | Suggestion was made to set '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH --job-name=sleep | ||
+ | #SBATCH --partition=mwgpu | ||
+ | ###SBATCH -n 1 | ||
+ | #SBATCH -B 1:1:1 | ||
+ | #SBATCH --mem=1024 | ||
+ | sleep 60 | ||
+ | </ | ||
+ | |||
+ | --- // | ||
+ | |||
+ | ** GPU-CPU cores ** | ||
+ | |||
+ | Noticed this with debug level on in slurmd.log. No action taken. | ||
+ | |||
+ | < | ||
+ | |||
+ | # n37: old gpu model bound to all physical cpu cores | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n78: somewhat dated gpu model, bound to top/bot of physical cores (16) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n79, more recent gpu model, same bound pattern of top/bot (24) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Partition Priority ** | ||
+ | |||
+ | If set you can list more than one queue... | ||
+ | |||
+ | < | ||
+ | srun --partition=exx96, | ||
+ | </ | ||
+ | |||
+ | The above will fill up n79 first, then n78, then n36... | ||
+ | |||
+ | ** Node Weight Priority ** | ||
+ | |||
+ | Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority). | ||
+ | < | ||
+ | hp12: 12/8 = 1.5 | ||
+ | tinymem: 32/20 = 1.6 | ||
+ | mw128: 128/24 = 5.333333 | ||
+ | mw256: 256/16 = 16 | ||
+ | |||
+ | exx96: 96/24 = 4 | ||
+ | amber128: 128/16 = 8 | ||
+ | mwgpu = 256/16 = 16 | ||
+ | </ | ||
+ | |||
+ | Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken. | ||
+ | |||
+ | < | ||
+ | tinymem | ||
+ | mw128 20 | ||
+ | mw256fd | ||
+ | mwgpu 40 + HasMem256 feature | ||
+ | amber128 | ||
+ | exx96 80 | ||
+ | </ | ||
+ | |||
+ | ** CR_CPU_Memory ** | ||
+ | |||
+ | Makes for a better 1-1 relationship of physical core to '' | ||
+ | |||
+ | Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node. | ||
+ | --- // | ||
+ | |||
+ | We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults. | ||
+ | |||
+ | < | ||
+ | srun --partition=test | ||
+ | </ | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||