Differences

This shows you the differences between two versions of the page.

--- cluster:208 [2021/10/14 19:39]
hmeij07
+++ cluster:208 [2022/11/02 17:28] (current)
hmeij07 [gpu testing]
@@ Line 8: / Line 8: @@
 There is a techie page at this location **[[cluster:207|Slurm Techie Page]]** for those of you who are interested in the setup.
-__This page is intended for users__ to get started with the Slurm scheduler. ''greentail52'' will be the slurm scheduler test "controller" and with several cpu+gpu compute nodes configured. Any jobs submitted should be simple, quick running jobs, like a "sleep" or "hello world" jobs. These compute nodes are still managed by Openlava.
+__This page is intended for users__ to get started with the Slurm scheduler. ''greentail52'' will be the slurm scheduler test "controller" with several cpu+gpu compute nodes configured. Any jobs submitted should be simple, quick running jobs, like a "sleep" or "hello world" jobs. The configured compute nodes are still managed by Openlava.
 ** Default Environment **
@@ Line 76: / Line 76: @@
 $ scontrol show node n78
 NodeName=n78 Arch=x86_64 CoresPerSocket=8
-   CPUAlloc=2 CPUTot=32 CPULoad=1.05
+   CPUAlloc=0 CPUTot=32 CPULoad=0.03
-   AvailableFeatures=hasLocalscratch           <<<--- available features
+   AvailableFeatures=hasLocalscratch
    ActiveFeatures=hasLocalscratch
-   Gres=gpu:geforce_gtx_1080_ti:4(S:0-1)       <<<--- generic resources
+   Gres=gpu:geforce_gtx_1080_ti:4(S:0-1)
    NodeAddr=n78 NodeHostName=n78 Version=21.08.1
    OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
-   RealMemory=128 AllocMem=128 FreeMem=16840 Sockets=2 Boards=1
+   RealMemory=128660 AllocMem=0 FreeMem=72987 Sockets=2 Boards=1
-   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   MemSpecLimit=1024
-   Partitions=test
+   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
-   BootTime=2021-03-28T20:35:53 SlurmdStartTime=2021-10-11T10:41:35
+   Partitions=test,amber128
-   LastBusyTime=2021-10-11T10:57:04
+   BootTime=2021-03-28T20:35:53 SlurmdStartTime=2021-10-14T13:56:00
-   CfgTRES=cpu=32,mem=128M,billing=32
+   LastBusyTime=2021-10-14T13:56:01
-   AllocTRES=cpu=2,mem=128M
+   CfgTRES=cpu=32,mem=128660M,billing=32
+   AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 # sorta like bhist -l
@@ Line 147: / Line 149: @@
 Same on the cpu only compute nodes. Features could be created for memory footprints (for example "hasMem64", "hasMem128", hasMem192", "hasMem256", "hasMem32"). Then all the cpu only nodes can go into one queue and we can stick all cpu+gpu nodes in another queue. Or all of them in a single queue. We'll see, just testing.
-On the resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with ''--exclusive'' (test queue only) or share nodes (other queues, wit ''--oversubscribe'')
+On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with ''--exclusive'' (test queue only) or share nodes (other queues, with ''--oversubscribe'')
-//Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing.//
+//Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? Solved, see Changes section.//
 ===== MPI =====
@@ Line 155: / Line 157: @@
 Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the ''libslurm.so'' library is not backwards compatible and all software using it would need to be recompiled.  There is a handy parallel job launcher which may be of use, it is called ''srun''.
-For now, we'll rely on PATH/LD_LIBRARY_PATH settings to the control environment. This also implies your job should run under Openlava or Slurm. With the new head node deployment we'll introduce ''modules'' to control the environment.
+For now, we'll rely on PATH/LD_LIBRARY_PATH settings to control the environment. This also implies your job should run under Openlava or Slurm. With the new head node deployment we'll introduce ''modules'' to control the environment for newly installed software.
 ''srun'' commands can be embedded in a job submission script but it can also run interactively. Like
@@ Line 161: / Line 163: @@
 <code>
-$ srun --partition=mwgpu -n 4 -B 1:1:1 --mem=1024 sleep 60 &
+$ srun --partition=mwgpu -n 4 -B 1:4:1 --mem=1024 sleep 60 &
 </code>
@@ Line 377: / Line 379: @@
 ===== Feedback =====
-If there are errors on this page, or mistatements, let me know. As we test and improve the setup to mimic a production environment I will update the page (and mark those entries with timestamp/signature).
+If there are errors on this page, or mistatements, let me know. As we test and improve the setup to mimic a production environment I will update the page (and mark those entries with newer timestamp/signature).
- --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/14 15:20//
+ --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 09:16//
+===== gpu testing =====
+  * test standalone slurm v 21.08.1
+  * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus
+  * submit one at a time, observe
+  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n33 only
+  * "resources" reason at 17th submit, used up 16 cores and 16 threads
+  * all on same gpu
+  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only
+  * "resources" reason at 17th submit too, same reason
+  * all gpus used? nope, all on the same one 0
+  * redoing above with a  ''export CUDA_VISIBLE_DEVICES=`shuf -i 0-3 -n 1`''
+  * even distribution across all gpus, 17th submit reason too
+  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail
+  * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0)
+  * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, n34 state=mix
+  * 33th job, "Resources" reason, job pending
+  * 34th job, "Priority" reason (?), job pending
+  * all n33 and n34 jobs on single gpu without cuda_visible set
+  * how that works with gpu util at 100% with one jobs is beyond me
+  * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours.
+  * ohpc v2.4 slurm v 20.11.8
+  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n100 only
+  * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu
+  * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu
+  * twisted logic
+  * so recent openhpc version but old slurm version in software stack
+  * trying standalone install on openhpc prod cluster - auth/munge error, no go
+  * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours
+  * ohpc v2.4 slurm v 20.11.8
+  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n78 only
+  * same as above but all 16 jobs run on gpu 0
+  * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon?
+  * all 16 jobs finished, waal times of 3.11 to 3.60 hours
+===== gpu testing 2 =====
+Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing...back to CR_CPU_Memory and OverSubscribe=No   --- //[[hmeij@wesleyan.edu|Henk]] 2022/11/02 13:23//
+<code>
+CR_Socket_Memory
+PartitionName=test Nodes=n[100-101]
+Default=YES MaxTime=INFINITE State=UP
+OverSubscribe=No DefCpuPerGPU=12
+MPI jobs with -N 1, -n 8 and -B 2:4:1
+no override options, cpus=48
+--mem=2048, cpus=48
+and --cpus-per-task=1, cpus=48
+and  --ntasks-per-node=8, cpus=24
+MPI jobs with -N, -n 8 and -B 1:8:1
+--mem=10240 cpus=48
+and --cpus-per-task=1, cpus=48
+and  --ntasks-per-node=8, cpus=24
+GPU jobs with -N 1, -n 1 and -B 1:1:1
+no override options, no cuda export, cpus=48
+--cpus-per-gpu=1, cpus=24
+and --mem-per-gpu=7168, cpus=1 (pending
+while other gpu runs in queue but gpus are free???)
+GPU jobs with -N 1, -n 1 and -B 1:1:1
+no override options, yes cuda export, cpus=48
+--cpus-per-gpu=1, cpus=24
+and --mem-per-gpu=7168, cpus=1 (resources pending
+while a gpu job runs, gpus are free, then it executes)
+...suddenly the cpus=1 turns into cpus=24
+when submitting, slurm confused becuase of all
+the job cancellations?
+CR_CPU_Memory test=no, mwgpu=force:16
+PartitionName=test Nodes=n[100-101]
+Default=YES MaxTime=INFINITE State=UP
+OverSubscribe=No DefCpuPerGPU=12
+MPI jobs with -N 1, -n 8 and -B 2:4:1
+no override options, cpus=8 (queue fills across nodes,
+but only one job per node, test & mwgpu)
+--mem=1024, cpus=8 (queue fills first node ...,
+but only three jobs per node, test 3x8=24 full 4th job pending &
+mwgpu 17th job goes pending on n33, overloaded with -n 8 !!)
+(not needed) --cpus-per-task=?, cpus=
+(not needed)  --ntasks-per-node=?, cpus=
+GPU jobs with -N 1, -n 1 and -B 1:1:1 on test
+no override options, no cuda export, cpus=12 (one gpu per node)
+--cpus-per-gpu=1, cpus=1 (one gpu per node)
+and --mem-per-gpu=7168, cpus=1 (both override options
+required else all mem allocated!, max 4 jobs per node,
+fills first node first...cuda export not needed)
+with cuda export, same node, same gpu,
+with "no" enabled multiple jobs per gpu not accepted
+GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu
+--cpus-per-gpu=1,
+and --mem-per-gpu=7168, cpus=1
+(same node, same gpu, cuda export set,
+with "force:16" enabled 4 jobs per gpu accepted,
+potential for overloading!)
+</code>
+===== Changes =====
+** OverSubscribe **
+Suggestion was made to set ''OverSubcribe=No'' for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for ''n37''.
+''for i in `seq 1 17`;do sbatch sleep; done''
+<code>
+#!/bin/bash
+#SBATCH --job-name=sleep
+#SBATCH --partition=mwgpu
+###SBATCH -n 1
+#SBATCH -B 1:1:1
+#SBATCH --mem=1024
+sleep 60
+</code>
+ --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 15:18//
+** GPU-CPU cores **
+Noticed this with debug level on in slurmd.log. No action taken.
+<code>
+# n37: old gpu model bound to all physical cpu cores
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,0,-1 /dev/nvidia3
+# n78: somewhat dated gpu model, bound to top/bot of physical cores (16)
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,0,-1 /dev/nvidia3
+# n79, more recent gpu model, same bound pattern of top/bot (24)
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:-1,0,0,0 /dev/nvidia0
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:0,-1,0,0 /dev/nvidia1
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,-1,0 /dev/nvidia2
+GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,0,-1 /dev/nvidia3
+</code>
+** Partition Priority **
+If set you can list more than one queue...
+<code>
+ srun --partition=exx96,amber128,mwgpu  --mem=1024  --gpus=1  --gres=gpu:any sleep 60 &
+</code>
+The above will fill up n79 first, then n78, then n36...
+** Node Weight Priority **
+Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).
+<code>
+hp12: 12/8 = 1.5
+tinymem: 32/20 = 1.6
+mw128: 128/24 = 5.333333
+mw256: 256/16 = 16
+exx96: 96/24 = 4
+amber128: 128/16 = 8
+mwgpu = 256/16 = 16
+</code>
+Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken.
+<code>
+tinymem   10
+mw128     20
+mw256fd  30   +  HasMem256 feature so cpu jobs can directly target large mem
+mwgpu    40    +  HasMem256 feature
+amber128  50
+exx96      80
+</code>
+** CR_CPU_Memory **
+Makes for a better 1-1 relationship of physical core to ''ntask'' yet the "hyperthreads" are still available to user jobs but physical cores are consumed first, if I got all this right.
+Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node.
+ --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/18 14:32//
+We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults.
+<code>
+ srun --partition=test  --mem=1024  --gres=gpu:geforce_rtx_2080_s:1 sleep 60 &
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools