This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:208 [2021/10/15 19:16] hmeij07 [Feedback] |
cluster:208 [2022/05/27 13:03] hmeij07 [gpu testing] |
||
---|---|---|---|
Line 151: | Line 151: | ||
On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' | On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' | ||
- | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending?// | + | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? |
===== MPI ===== | ===== MPI ===== | ||
Line 383: | Line 383: | ||
--- // | --- // | ||
+ | ===== gpu testing ===== | ||
+ | |||
+ | * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus | ||
+ | * submit one at a time, observe | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * " | ||
+ | * all on same gpu | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only | ||
+ | * " | ||
+ | * all gpus used? nope, all on the same one 0 | ||
+ | * redoing above with a '' | ||
+ | * even distribution across all gpus, 17th submit reason too | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail | ||
+ | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
+ | * 17th submit goes to n34, gpu 1 (weird) | ||
===== Changes ===== | ===== Changes ===== | ||
- | Suggestion was made to set Oversubcribe=No for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core for a total of 32. | + | |
+ | ** OverSubscribe ** | ||
+ | |||
+ | Suggestion was made to set '' | ||
'' | '' | ||
Line 397: | Line 415: | ||
#SBATCH --mem=1024 | #SBATCH --mem=1024 | ||
sleep 60 | sleep 60 | ||
+ | </ | ||
+ | |||
+ | --- // | ||
+ | |||
+ | ** GPU-CPU cores ** | ||
+ | |||
+ | Noticed this with debug level on in slurmd.log. No action taken. | ||
+ | |||
+ | < | ||
+ | |||
+ | # n37: old gpu model bound to all physical cpu cores | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n78: somewhat dated gpu model, bound to top/bot of physical cores (16) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n79, more recent gpu model, same bound pattern of top/bot (24) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Partition Priority ** | ||
+ | |||
+ | If set you can list more than one queue... | ||
+ | |||
+ | < | ||
+ | srun --partition=exx96, | ||
+ | </ | ||
+ | |||
+ | The above will fill up n79 first, then n78, then n36... | ||
+ | |||
+ | ** Node Weight Priority ** | ||
+ | |||
+ | Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority). | ||
+ | < | ||
+ | hp12: 12/8 = 1.5 | ||
+ | tinymem: 32/20 = 1.6 | ||
+ | mw128: 128/24 = 5.333333 | ||
+ | mw256: 256/16 = 16 | ||
+ | |||
+ | exx96: 96/24 = 4 | ||
+ | amber128: 128/16 = 8 | ||
+ | mwgpu = 256/16 = 16 | ||
+ | </ | ||
+ | |||
+ | Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken. | ||
+ | |||
+ | < | ||
+ | tinymem | ||
+ | mw128 20 | ||
+ | mw256fd | ||
+ | mwgpu 40 + HasMem256 feature | ||
+ | amber128 | ||
+ | exx96 80 | ||
+ | </ | ||
+ | |||
+ | ** CR_CPU_Memory ** | ||
+ | |||
+ | Makes for a better 1-1 relationship of physical core to '' | ||
+ | |||
+ | Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node. | ||
+ | --- // | ||
+ | |||
+ | We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults. | ||
+ | |||
+ | < | ||
+ | srun --partition=test | ||
</ | </ | ||