cluster:208
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:208 [2022/05/26 18:56] – [gpu testing] hmeij07 | cluster:208 [2022/11/02 17:28] (current) – [gpu testing] hmeij07 | ||
|---|---|---|---|
| Line 385: | Line 385: | ||
| ===== gpu testing ===== | ===== gpu testing ===== | ||
| - | * n33 only, 4 gpus, 16 cores, 16 threads, 32 cpus | + | |
| + | | ||
| * submit one at a time, observe | * submit one at a time, observe | ||
| - | * part=test, n 1, B 1:1:1, cuda_visible=0, | + | * part=test, n 1, B 1:1:1, cuda_visible=0, |
| - | * " | + | * " |
| * all on same gpu | * all on same gpu | ||
| + | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only | ||
| + | * " | ||
| + | * all gpus used? nope, all on the same one 0 | ||
| + | * redoing above with a '' | ||
| + | * even distribution across all gpus, 17th submit reason too | ||
| + | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail | ||
| + | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
| + | * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | ||
| + | * 33th job, " | ||
| + | * 34th job, " | ||
| + | * all n33 and n34 jobs on single gpu without cuda_visible set | ||
| + | * how that works with gpu util at 100% with one jobs is beyond me | ||
| + | * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours. | ||
| + | |||
| + | * ohpc v2.4 slurm v 20.11.8 | ||
| + | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
| + | * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu | ||
| + | * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu | ||
| + | * twisted logic | ||
| + | * so recent openhpc version but old slurm version in software stack | ||
| + | * trying standalone install on openhpc prod cluster - auth/munge error, no go | ||
| + | * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours | ||
| + | |||
| + | * ohpc v2.4 slurm v 20.11.8 | ||
| + | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
| + | * same as above but all 16 jobs run on gpu 0 | ||
| + | * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon? | ||
| + | * all 16 jobs finished, waal times of 3.11 to 3.60 hours | ||
| + | |||
| + | ===== gpu testing 2 ===== | ||
| + | |||
| + | Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing...back to CR_CPU_Memory and OverSubscribe=No | ||
| + | |||
| + | < | ||
| + | |||
| + | CR_Socket_Memory | ||
| + | PartitionName=test Nodes=n[100-101] | ||
| + | Default=YES MaxTime=INFINITE State=UP | ||
| + | OverSubscribe=No DefCpuPerGPU=12 | ||
| + | |||
| + | MPI jobs with -N 1, -n 8 and -B 2:4:1 | ||
| + | no override options, cpus=48 | ||
| + | --mem=2048, cpus=48 | ||
| + | and --cpus-per-task=1, | ||
| + | and --ntasks-per-node=8, | ||
| + | |||
| + | MPI jobs with -N, -n 8 and -B 1:8:1 | ||
| + | --mem=10240 cpus=48 | ||
| + | and --cpus-per-task=1, | ||
| + | and --ntasks-per-node=8, | ||
| + | |||
| + | GPU jobs with -N 1, -n 1 and -B 1:1:1 | ||
| + | no override options, no cuda export, cpus=48 | ||
| + | --cpus-per-gpu=1, | ||
| + | and --mem-per-gpu=7168, | ||
| + | while other gpu runs in queue but gpus are free???) | ||
| + | |||
| + | GPU jobs with -N 1, -n 1 and -B 1:1:1 | ||
| + | no override options, yes cuda export, cpus=48 | ||
| + | --cpus-per-gpu=1, | ||
| + | and --mem-per-gpu=7168, | ||
| + | while a gpu job runs, gpus are free, then it executes) | ||
| + | |||
| + | ...suddenly the cpus=1 turns into cpus=24 | ||
| + | when submitting, slurm confused becuase of all | ||
| + | the job cancellations? | ||
| + | |||
| + | CR_CPU_Memory test=no, mwgpu=force: | ||
| + | PartitionName=test Nodes=n[100-101] | ||
| + | Default=YES MaxTime=INFINITE State=UP | ||
| + | OverSubscribe=No DefCpuPerGPU=12 | ||
| + | |||
| + | MPI jobs with -N 1, -n 8 and -B 2:4:1 | ||
| + | no override options, cpus=8 (queue fills across nodes, | ||
| + | but only one job per node, test & mwgpu) | ||
| + | --mem=1024, cpus=8 (queue fills first node ..., | ||
| + | but only three jobs per node, test 3x8=24 full 4th job pending & | ||
| + | mwgpu 17th job goes pending on n33, overloaded with -n 8 !!) | ||
| + | (not needed) --cpus-per-task=?, | ||
| + | (not needed) | ||
| + | |||
| + | |||
| + | GPU jobs with -N 1, -n 1 and -B 1:1:1 on test | ||
| + | no override options, no cuda export, cpus=12 (one gpu per node) | ||
| + | --cpus-per-gpu=1, | ||
| + | and --mem-per-gpu=7168, | ||
| + | required else all mem allocated!, max 4 jobs per node, | ||
| + | fills first node first...cuda export not needed) | ||
| + | with cuda export, same node, same gpu, | ||
| + | with " | ||
| + | |||
| + | |||
| + | GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu | ||
| + | --cpus-per-gpu=1, | ||
| + | and --mem-per-gpu=7168, | ||
| + | (same node, same gpu, cuda export set, | ||
| + | with " | ||
| + | potential for overloading!) | ||
| + | |||
| + | </ | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| ===== Changes ===== | ===== Changes ===== | ||
cluster/208.1653591376.txt.gz · Last modified: by hmeij07
