This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:208 [2021/10/15 12:57] hmeij07 |
cluster:208 [2021/10/18 17:41] hmeij07 [Changes] |
||
---|---|---|---|
Line 149: | Line 149: | ||
Same on the cpu only compute nodes. Features could be created for memory footprints (for example " | Same on the cpu only compute nodes. Features could be created for memory footprints (for example " | ||
- | On the resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' | + | On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' |
- | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending?// | + | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? |
===== MPI ===== | ===== MPI ===== | ||
Line 157: | Line 157: | ||
Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the '' | Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the '' | ||
- | For now, we'll rely on PATH/ | + | For now, we'll rely on PATH/ |
'' | '' | ||
Line 163: | Line 163: | ||
< | < | ||
- | $ srun --partition=mwgpu -n 4 -B 1:1:1 --mem=1024 sleep 60 & | + | $ srun --partition=mwgpu -n 4 -B 1:4:1 --mem=1024 sleep 60 & |
</ | </ | ||
Line 379: | Line 379: | ||
===== Feedback ===== | ===== Feedback ===== | ||
- | If there are errors on this page, or mistatements, | + | If there are errors on this page, or mistatements, |
- | --- // | + | --- // |
+ | ===== Changes ===== | ||
+ | |||
+ | ** OverSubscribe ** | ||
+ | |||
+ | Suggestion was made to set '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH --job-name=sleep | ||
+ | #SBATCH --partition=mwgpu | ||
+ | ###SBATCH -n 1 | ||
+ | #SBATCH -B 1:1:1 | ||
+ | #SBATCH --mem=1024 | ||
+ | sleep 60 | ||
+ | </ | ||
+ | |||
+ | --- // | ||
+ | |||
+ | ** GPU-CPU cores ** | ||
+ | |||
+ | Noticed this with debug level on in slurmd.log | ||
+ | |||
+ | < | ||
+ | |||
+ | # n37: old gpu model bound to all physical cpu cores | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n78: somewhat dated gpu model, bound to top/bot of physical cores (16) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n79, more recent gpu model, same bound pattern of top/bot (24) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Weight Priority ** | ||
+ | |||
+ | Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority). | ||
+ | |||
+ | hp12: 12/8 = 1.5 | ||
+ | tinymem: 32/20 = 1.6 | ||
+ | mw128: 128/24 = 5.333333 | ||
+ | mw256: 256/16 = 16 | ||
+ | |||
+ | exx96: 96/24 = 4 | ||
+ | amber128: 128/16 = 8 | ||
+ | mwgpu = 256/16 = 16 | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||