Table of Contents


Back

Slurm Test Env

There is a techie page at this location Slurm Techie Page for those of you who are interested in the setup.

This page is intended for users to get started with the Slurm scheduler. greentail52 will be the slurm scheduler test “controller” with several cpu+gpu compute nodes configured. Any jobs submitted should be simple, quick running jobs, like a “sleep” or “hello world” jobs. The configured compute nodes are still managed by Openlava.

Default Environment

Slurm was compiled within this environment

# installer found /usr/local/cuda symbolic link to n37-cuda-9.2
export CUDAHOME=/usr/local/n37-cuda-9.2
export PATH=/usr/local/n37-cuda-9.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/n37-cuda-9.2/lib64:$LD_LIBRARY_PATH
which nvcc

# openmpi, just in case
export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
which mpirun

Slurm Location

# for now, symbolic link to slurm-21.08.01
export PATH=/usr/local/slurm/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH

Basic Commands

# sorta like bqueues
$ sinfo -l
Thu Oct 14 09:27:02 2021
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
test*        up   infinite 1-infinite   no EXCLUSIV        all      3        idle n[37,78-79]
mwgpu        up   infinite 1-infinite   no    YES:4        all      1        idle n37
amber128     up   infinite 1-infinite   no    YES:4        all      1        idle n78
exx96        up   infinite 1-infinite   no    YES:4        all      1        idle n79

# more node info
$ sinfo -lN
Thu Oct 14 13:57:12 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
n37            1     test*        idle 32      2:8:2 257917        0      1 hasLocal none
n37            1     mwgpu        idle 32      2:8:2 257917        0      1 hasLocal none
n78            1  amber128        idle 32      2:8:2 128660        0      1 hasLocal none
n78            1     test*        idle 32      2:8:2 128660        0      1 hasLocal none
n79            1     test*        idle 48     2:12:2  95056        0      1 hasLocal none
n79            1     exx96        idle 48     2:12:2  95056        0      1 hasLocal none


# sorta like bsub
$ sbatch run
Submitted batch job 1000002

# sorta like bjobs
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1000002      test     test    hmeij  R       0:08      1 n78

# sorta like bhosts -l
$ scontrol show node n78
NodeName=n78 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=32 CPULoad=0.03
   AvailableFeatures=hasLocalscratch
   ActiveFeatures=hasLocalscratch
   Gres=gpu:geforce_gtx_1080_ti:4(S:0-1)
   NodeAddr=n78 NodeHostName=n78 Version=21.08.1
   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
   RealMemory=128660 AllocMem=0 FreeMem=72987 Sockets=2 Boards=1
   MemSpecLimit=1024
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=test,amber128
   BootTime=2021-03-28T20:35:53 SlurmdStartTime=2021-10-14T13:56:00
   LastBusyTime=2021-10-14T13:56:01
   CfgTRES=cpu=32,mem=128660M,billing=32
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


# sorta like bhist -l
$ scontrol show job 1000002
JobId=1000002 JobName=test
   UserId=hmeij(8216) GroupId=its(623) MCS_label=N/A
   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:18 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2021-10-11T13:27:58 EligibleTime=2021-10-11T13:27:58
   AccrueTime=2021-10-11T13:27:58
   StartTime=2021-10-11T13:27:58 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-10-11T13:27:58 Scheduler=Main
   Partition=test AllocNode:Sid=greentail52:70776
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n78
   BatchHost=n78
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:1
   TRES=cpu=2,mem=128M,node=1,billing=2
   Socks/Node=1 NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=128M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/zfshomes/hmeij/slurm/run
   WorkDir=/zfshomes/hmeij/slurm
   StdErr=/zfshomes/hmeij/slurm/err
   StdIn=/dev/null
   StdOut=/zfshomes/hmeij/slurm/out
   Power=
   TresPerNode=gres:gpu:1
   MailUser=hmeij@wesleyan.edu MailType=END

# sorta like bkill
$ scancel 1000003

Documentation

Overview

From the information above it is a matter of learning new terminology and how to control devices. As you can see sinfo shows that nodes can exist in multiple partitions (queues, the '*' denotes default queue). So we could simply rebuild our queues in Slurm. But Slurm also presents node “features” (arbitrary resources like “hasLocalscracth”) and/or node “generic resources” (consumable, boolean, resources, like “gpu”). With a combination of those or just a very specific request for a resource you can control the routing of your job. For example, queue test contains 3 nodes but requesting resource gpu:geforce_rtx_2080_s assures you end up on node n79. Or you can simply request gpu:1 if gpu model is not important.

Same on the cpu only compute nodes. Features could be created for memory footprints (for example “hasMem64”, “hasMem128”, hasMem192“, “hasMem256”, “hasMem32”). Then all the cpu only nodes can go into one queue and we can stick all cpu+gpu nodes in another queue. Or all of them in a single queue. We'll see, just testing.

On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with –exclusive (test queue only) or share nodes (other queues, with –oversubscribe)

Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? Solved, see Changes section.

MPI

Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the libslurm.so library is not backwards compatible and all software using it would need to be recompiled. There is a handy parallel job launcher which may be of use, it is called srun.

For now, we'll rely on PATH/LD_LIBRARY_PATH settings to control the environment. This also implies your job should run under Openlava or Slurm. With the new head node deployment we'll introduce modules to control the environment for newly installed software.

srun commands can be embedded in a job submission script but it can also run interactively. Like

$ srun --partition=mwgpu -n 4 -B 1:4:1 --mem=1024 sleep 60 &

For more details on srun consult https://slurm.schedmd.com/srun.html

Script

Putting it all together a job submission script might look like example below. Simply submit to sbatch, assuming the job script name is run

$ sbatch run

Sample Submit Script

#!/bin/bash
# [found at XStream]
# Slurm will IGNORE all lines after the FIRST BLANK LINE,
# even the ones containing #SBATCH.
# Always put your SBATCH parameters at the top of your batch script.
# Took me days to find, [constraint=|gres=] were not working ... silly behavior -Henk
#
# GENERAL
#SBATCH --job-name="test"
#SBATCH --output=out   # or both in default file
#SBATCH --error=err    # slurm-$SLURM_JOBID.out
#SBATCH --mail-type=END
#SBATCH --mail-user=username@wesleyan.edu
#
# NODE control
#SBATCH -N 1     # default, nodes
###SBATCH --nodelist=n78,n79
###SBATCH --constraint=hasLocalscratch     # n37, n78
###SBATCH --constraint=hasLocalscratch1tb  # n79
###SBATCH --exclusive                      # test queue only
###SBATCH --oversubscribe                  # not on test queue
#
# CPU control
#SBATCH -n 8     # total cpus request is tasks=N(S*C*T)
#SBATCH -B 1:4:2 # S:C:T=sockets/node:cores/socket:threads/core
#
# GPU control
###SBATCH --gres=gpu:geforce_gtx_1080_ti:1 # n78
###SBATCH --gres=gpu:geforce_rtx_2080_s:1  # n79
###SBATCH --gres=gpu:tesla_k20m:1          # n37
#SBATCH --gres=gpu:1                     # any

# ENV control
# openmpi
export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
which mpirun

# unique job scratch dir created(prolog)/cleaned(epilog)
export MYSANSCRATCH=/sanscratch/$SLURM_JOBID
export MYLOCALSCRATCH=/localscratch/$SLURM_JOBID
cd $MYLOCALSCRATCH
pwd

# CPU serial job example
date  # look in stdout file
datee # look in stderr file
env | grep ^SLURM
echo "hello world of slurm"
touch foo
ls -l foo

# CPU mpi example, note: no -np flag, no --hostfile
mpirun $HOME/slurm/hello_c

# GPU docker example, be sure to select rtx2080s gpu
# manual "wrapper" setup to find idle gpu, on localhost
# cuda 10.2
gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` "
echo ""; echo "docker running on gpu $HOSTNAME:$gpuid"; echo ""
#export CUDA_VISIBLE_DEVICES=$gpuid # or NV_GPU

NV_GPU=$gpuid \
nvidia-docker run --rm -u $(id -u):$(id -g) \
-v /$HOME:/mnt/$USER \
-v /home/apps:/mnt/apps \
-v /usr/local:/mnt/local \
nvcr.io/nvidia/tensorflow:19.09-py2 python \
/mnt/$USER/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/run_tests.py \
--num_gpus=1 --batch_size=64 \
--model=resnet50 \
--variable_update=parameter_server > $HOME/slurm/out.docker

sleep 5m  # so you can query job/node with scontrol

Script Output

The relevant sections of the script above should generate output like this

err file

# the stderr file starts with
/var/spool/slurmd/job1000056/slurm_script: line 47: datee: command not found

# lots of tensorflow warnings
<snip>

# and that apps writes to stderr

----------------------------------------------------------------------
Ran 104 tests in 197.454s

OK (skipped=12)

out file

/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
/localscratch/1000056
Thu Oct 14 10:36:22 EDT 2021
SLURM_NODELIST=n79
SLURM_JOB_NAME=test
SLURMD_NODENAME=n79
SLURM_TOPOLOGY_ADDR=n79
SLURM_THREADS_PER_CORE=2
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_GPUS_ON_NODE=4
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_JOB_GPUS=0,1,2,3
SLURM_NNODES=1
SLURM_JOBID=1000056
SLURM_NTASKS=8
SLURM_TASKS_PER_NODE=8
SLURM_WORKING_CLUSTER=slurmcluster:greentail52:6817:9472:109
SLURM_CONF=/usr/local/slurm-21.08.1/etc/slurm.conf
SLURM_JOB_ID=1000056
SLURM_JOB_USER=hmeij
SLURM_JOB_UID=8216
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/zfshomes/hmeij/slurm
SLURM_TASK_PID=257975
SLURM_NPROCS=8
SLURM_CPUS_ON_NODE=48
SLURM_PROCID=0
SLURM_JOB_NODELIST=n79
SLURM_LOCALID=0
SLURM_JOB_GID=623
SLURM_JOB_CPUS_PER_NODE=48
SLURM_CLUSTER_NAME=slurmcluster
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=greentail52
SLURM_JOB_PARTITION=test
SLURM_JOB_NUM_NODES=1
SLURM_MEM_PER_NODE=192
hello world of slurm
-rw-r--r-- 1 hmeij its 0 Oct 14 10:36 foo
Hello, world, I am 0 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 1 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 4 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 5 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 2 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 3 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 6 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
Hello, world, I am 7 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)

docker running on gpu n79:3

and the out.docker file

================
== TensorFlow ==
================

NVIDIA Release 19.09 (build 8044706)
TensorFlow Version 1.14.0

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.

<snip>

Generating training model
Initializing graph
Running warm up
Done warm up

<snip>

----------------------------------------------------------------
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 110.3 +/- 0.0 (jitter = 0.0)        1.156250119209290
2       images/sec: 213.2 +/- 1096.9 (jitter = 2299.9)  7.638743400573730
3       images/sec: 309.0 +/- 822.7 (jitter = 246.5)    -2.596951484680176
4       images/sec: 398.6 +/- 649.2 (jitter = 123.3)    -35.271511077880859
----------------------------------------------------------------
total images/sec: 378.12
----------------------------------------------------------------

Feedback

If there are errors on this page, or mistatements, let me know. As we test and improve the setup to mimic a production environment I will update the page (and mark those entries with newer timestamp/signature).

Henk 2021/10/15 09:16

gpu testing

gpu testing 2

Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing…back to CR_CPU_Memory and OverSubscribe=No — Henk 2022/11/02 13:23

CR_Socket_Memory
PartitionName=test Nodes=n[100-101] 
Default=YES MaxTime=INFINITE State=UP 
OverSubscribe=No DefCpuPerGPU=12

MPI jobs with -N 1, -n 8 and -B 2:4:1
no override options, cpus=48
--mem=2048, cpus=48
and --cpus-per-task=1, cpus=48
and  --ntasks-per-node=8, cpus=24

MPI jobs with -N, -n 8 and -B 1:8:1
--mem=10240 cpus=48
and --cpus-per-task=1, cpus=48
and  --ntasks-per-node=8, cpus=24

GPU jobs with -N 1, -n 1 and -B 1:1:1 
no override options, no cuda export, cpus=48
--cpus-per-gpu=1, cpus=24
and --mem-per-gpu=7168, cpus=1 (pending
while other gpu runs in queue but gpus are free???)

GPU jobs with -N 1, -n 1 and -B 1:1:1 
no override options, yes cuda export, cpus=48
--cpus-per-gpu=1, cpus=24
and --mem-per-gpu=7168, cpus=1 (resources pending
while a gpu job runs, gpus are free, then it executes)

...suddenly the cpus=1 turns into cpus=24
when submitting, slurm confused becuase of all
the job cancellations?

CR_CPU_Memory test=no, mwgpu=force:16
PartitionName=test Nodes=n[100-101] 
Default=YES MaxTime=INFINITE State=UP 
OverSubscribe=No DefCpuPerGPU=12

MPI jobs with -N 1, -n 8 and -B 2:4:1
no override options, cpus=8 (queue fills across nodes,
but only one job per node, test & mwgpu)
--mem=1024, cpus=8 (queue fills first node ...,
but only three jobs per node, test 3x8=24 full 4th job pending & 
mwgpu 17th job goes pending on n33, overloaded with -n 8 !!)
(not needed) --cpus-per-task=?, cpus=
(not needed)  --ntasks-per-node=?, cpus=


GPU jobs with -N 1, -n 1 and -B 1:1:1 on test
no override options, no cuda export, cpus=12 (one gpu per node)
--cpus-per-gpu=1, cpus=1 (one gpu per node)
and --mem-per-gpu=7168, cpus=1 (both override options
required else all mem allocated!, max 4 jobs per node,
fills first node first...cuda export not needed)
with cuda export, same node, same gpu,
with "no" enabled multiple jobs per gpu not accepted


GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu
--cpus-per-gpu=1,
and --mem-per-gpu=7168, cpus=1 
(same node, same gpu, cuda export set, 
with "force:16" enabled 4 jobs per gpu accepted,
potential for overloading!)

Changes

OverSubscribe

Suggestion was made to set OverSubcribe=No for all partitions (thanks, Colin). We now observe with a simple sleep script that we can run 16 jobs simultaneously (with either -n or -B). So that's 16 physical cores, each has a logical core (thread) for a total of 32 cpus for n37.

for i in `seq 1 17`;do sbatch sleep; done

#!/bin/bash
#SBATCH --job-name=sleep
#SBATCH --partition=mwgpu
###SBATCH -n 1
#SBATCH -B 1:1:1
#SBATCH --mem=1024
sleep 60

Henk 2021/10/15 15:18

GPU-CPU cores

Noticed this with debug level on in slurmd.log. No action taken.

 
# n37: old gpu model bound to all physical cpu cores
GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:-1,0,0,0 /dev/nvidia0
GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,-1,0,0 /dev/nvidia1
GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,-1,0 /dev/nvidia2
GRES[gpu] Type:tesla_k20m Count:1 Cores(32):0-15  Links:0,0,0,-1 /dev/nvidia3

# n78: somewhat dated gpu model, bound to top/bot of physical cores (16)
GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:-1,0,0,0 /dev/nvidia0
GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):0-7   Links:0,-1,0,0 /dev/nvidia1
GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,-1,0 /dev/nvidia2
GRES[gpu] Type:geforce_gtx_1080_ti Count:1 Cores(32):8-15  Links:0,0,0,-1 /dev/nvidia3

# n79, more recent gpu model, same bound pattern of top/bot (24)
GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:-1,0,0,0 /dev/nvidia0
GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):0-11  Links:0,-1,0,0 /dev/nvidia1
GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,-1,0 /dev/nvidia2
GRES[gpu] Type:geforce_rtx_2080_s Count:1 Cores(48):12-23  Links:0,0,0,-1 /dev/nvidia3

Partition Priority

If set you can list more than one queue…

 srun --partition=exx96,amber128,mwgpu  --mem=1024  --gpus=1  --gres=gpu:any sleep 60 &

The above will fill up n79 first, then n78, then n36…

Node Weight Priority

Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).

hp12: 12/8 = 1.5
tinymem: 32/20 = 1.6
mw128: 128/24 = 5.333333
mw256: 256/16 = 16

exx96: 96/24 = 4
amber128: 128/16 = 8
mwgpu = 256/16 = 16

Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken.

tinymem   10
mw128     20
mw256fd  30   +  HasMem256 feature so cpu jobs can directly target large mem
mwgpu    40    +  HasMem256 feature
amber128  50
exx96      80

CR_CPU_Memory

Makes for a better 1-1 relationship of physical core to ntask yet the “hyperthreads” are still available to user jobs but physical cores are consumed first, if I got all this right.

Deployed. My need to set threads=1 and cpus=(quantity of physical cores)…this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node. — Henk 2021/10/18 14:32

We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults.

 srun --partition=test  --mem=1024  --gres=gpu:geforce_rtx_2080_s:1 sleep 60 &


Back