Differences

This shows you the differences between two versions of the page.

--- cluster:208 [2021/10/14 19:36]
hmeij07 created
+++ cluster:208 [2021/10/15 13:02]
hmeij07 [Overview]
@@ Line 4: / Line 4: @@
 ===== Slurm Test Env =====
+There is a techie page at this location **[[cluster:207|Slurm Techie Page]]** for those of you who are interested in the setup.
+__This page is intended for users__ to get started with the Slurm scheduler. ''greentail52'' will be the slurm scheduler test "controller" with several cpu+gpu compute nodes configured. Any jobs submitted should be simple, quick running jobs, like a "sleep" or "hello world" jobs. The configured compute nodes are still managed by Openlava.
+** Default Environment **
+Slurm was compiled within this environment
+<code>
+# installer found /usr/local/cuda symbolic link to n37-cuda-9.2
+export CUDAHOME=/usr/local/n37-cuda-9.2
+export PATH=/usr/local/n37-cuda-9.2/bin:$PATH
+export LD_LIBRARY_PATH=/usr/local/n37-cuda-9.2/lib64:$LD_LIBRARY_PATH
+which nvcc
+# openmpi, just in case
+export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
+export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
+which mpirun
+</code>
+** Slurm Location **
+<code>
+# for now, symbolic link to slurm-21.08.01
+export PATH=/usr/local/slurm/bin:$PATH
+export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH
+</code>
+===== Basic Commands =====
+<code>
+# sorta like bqueues
+$ sinfo -l
+Thu Oct 14 09:27:02 2021
+PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
+test*        up   infinite 1-infinite   no EXCLUSIV        all      3        idle n[37,78-79]
+mwgpu        up   infinite 1-infinite   no    YES:4        all      1        idle n37
+amber128     up   infinite 1-infinite   no    YES:4        all      1        idle n78
+exx96        up   infinite 1-infinite   no    YES:4        all      1        idle n79
+# more node info
+$ sinfo -lN
+Thu Oct 14 13:57:12 2021
+NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
+n37            1     test*        idle 32      2:8:2 257917        0      1 hasLocal none
+n37            1     mwgpu        idle 32      2:8:2 257917        0      1 hasLocal none
+n78            1  amber128        idle 32      2:8:2 128660        0      1 hasLocal none
+n78            1     test*        idle 32      2:8:2 128660        0      1 hasLocal none
+n79            1     test*        idle 48     2:12:2  95056        0      1 hasLocal none
+n79            1     exx96        idle 48     2:12:2  95056        0      1 hasLocal none
+# sorta like bsub
+$ sbatch run
+Submitted batch job 1000002
+# sorta like bjobs
+$ squeue
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+           1000002      test     test    hmeij  R       0:08      1 n78
+# sorta like bhosts -l
+$ scontrol show node n78
+NodeName=n78 Arch=x86_64 CoresPerSocket=8
+   CPUAlloc=0 CPUTot=32 CPULoad=0.03
+   AvailableFeatures=hasLocalscratch
+   ActiveFeatures=hasLocalscratch
+   Gres=gpu:geforce_gtx_1080_ti:4(S:0-1)
+   NodeAddr=n78 NodeHostName=n78 Version=21.08.1
+   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
+   RealMemory=128660 AllocMem=0 FreeMem=72987 Sockets=2 Boards=1
+   MemSpecLimit=1024
+   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=test,amber128
+   BootTime=2021-03-28T20:35:53 SlurmdStartTime=2021-10-14T13:56:00
+   LastBusyTime=2021-10-14T13:56:01
+   CfgTRES=cpu=32,mem=128660M,billing=32
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
+# sorta like bhist -l
+$ scontrol show job 1000002
+JobId=1000002 JobName=test
+   UserId=hmeij(8216) GroupId=its(623) MCS_label=N/A
+   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
+   JobState=RUNNING Reason=None Dependency=(null)
+   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
+   RunTime=00:03:18 TimeLimit=UNLIMITED TimeMin=N/A
+   SubmitTime=2021-10-11T13:27:58 EligibleTime=2021-10-11T13:27:58
+   AccrueTime=2021-10-11T13:27:58
+   StartTime=2021-10-11T13:27:58 EndTime=Unknown Deadline=N/A
+   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-10-11T13:27:58 Scheduler=Main
+   Partition=test AllocNode:Sid=greentail52:70776
+   ReqNodeList=(null) ExcNodeList=(null)
+   NodeList=n78
+   BatchHost=n78
+   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:1
+   TRES=cpu=2,mem=128M,node=1,billing=2
+   Socks/Node=1 NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
+   MinCPUsNode=1 MinMemoryNode=128M MinTmpDiskNode=0
+   Features=(null) DelayBoot=00:00:00
+   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
+   Command=/zfshomes/hmeij/slurm/run
+   WorkDir=/zfshomes/hmeij/slurm
+   StdErr=/zfshomes/hmeij/slurm/err
+   StdIn=/dev/null
+   StdOut=/zfshomes/hmeij/slurm/out
+   Power=
+   TresPerNode=gres:gpu:1
+   MailUser=hmeij@wesleyan.edu MailType=END
+# sorta like bkill
+$ scancel 1000003
+</code>
+===== Documentation =====
+  * manual pages for conf files or commands, for example
+    * ''man lsf.conf''
+    * ''man sbatch''
+  * developers documentation web site
+    * https://slurm.schedmd.com/overview.html
+    * https://slurm.schedmd.com/documentation.html
+    * https://slurm.schedmd.com/quickstart.html
+    * https://slurm.schedmd.com/pdfs/summary.pdf (2 page overview)
+    * etc...
+===== Overview =====
+From the information above it is a matter of learning new terminology and how to control devices. As you can see ''sinfo'' shows that nodes can exist in multiple partitions (queues, the '*' denotes default queue). So we could simply rebuild our queues in Slurm. But Slurm also presents node "features" (arbitrary resources like "hasLocalscracth") and/or node "generic resources" (consumable, boolean, resources, like "gpu"). With a combination of those or just a very specific request for a resource you can control the routing of your job. For example, queue ''test'' contains 3 nodes but requesting resource ''gpu:geforce_rtx_2080_s'' assures you end up on node **n79**. Or you can simply request ''gpu:1'' if gpu model is not important.
+Same on the cpu only compute nodes. Features could be created for memory footprints (for example "hasMem64", "hasMem128", hasMem192", "hasMem256", "hasMem32"). Then all the cpu only nodes can go into one queue and we can stick all cpu+gpu nodes in another queue. Or all of them in a single queue. We'll see, just testing.
+On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with ''--exclusive'' (test queue only) or share nodes (other queues, with ''--oversubscribe'')
+//Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending?//
+===== MPI =====
+Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the ''libslurm.so'' library is not backwards compatible and all software using it would need to be recompiled.  There is a handy parallel job launcher which may be of use, it is called ''srun''.
+For now, we'll rely on PATH/LD_LIBRARY_PATH settings to the control environment. This also implies your job should run under Openlava or Slurm. With the new head node deployment we'll introduce ''modules'' to control the environment.
+''srun'' commands can be embedded in a job submission script but it can also run interactively. Like
+<code>
+$ srun --partition=mwgpu -n 4 -B 1:1:1 --mem=1024 sleep 60 &
+</code>
+For more details on srun consult https://slurm.schedmd.com/srun.html
+===== Script =====
+Putting it all together a job submission script might look like example below. Simply submit to sbatch, assuming the job script name is ''run''
+<code>
+$ sbatch run
+</code>
+** Sample Submit Script  **
+<code>
+#!/bin/bash
+# [found at XStream]
+# Slurm will IGNORE all lines after the FIRST BLANK LINE,
+# even the ones containing #SBATCH.
+# Always put your SBATCH parameters at the top of your batch script.
+# Took me days to find, [constraint=|gres=] were not working ... silly behavior -Henk
+#
+# GENERAL
+#SBATCH --job-name="test"
+#SBATCH --output=out   # or both in default file
+#SBATCH --error=err    # slurm-$SLURM_JOBID.out
+#SBATCH --mail-type=END
+#SBATCH --mail-user=username@wesleyan.edu
+#
+# NODE control
+#SBATCH -N 1     # default, nodes
+###SBATCH --nodelist=n78,n79
+###SBATCH --constraint=hasLocalscratch     # n37, n78
+###SBATCH --constraint=hasLocalscratch1tb  # n79
+###SBATCH --exclusive                      # test queue only
+###SBATCH --oversubscribe                  # not on test queue
+#
+# CPU control
+#SBATCH -n 8     # total cpus request is tasks=N(S*C*T)
+#SBATCH -B 1:4:2 # S:C:T=sockets/node:cores/socket:threads/core
+#
+# GPU control
+###SBATCH --gres=gpu:geforce_gtx_1080_ti:1 # n78
+###SBATCH --gres=gpu:geforce_rtx_2080_s:1  # n79
+###SBATCH --gres=gpu:tesla_k20m:1          # n37
+#SBATCH --gres=gpu:1                     # any
+# ENV control
+# openmpi
+export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
+export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
+which mpirun
+# unique job scratch dir created(prolog)/cleaned(epilog)
+export MYSANSCRATCH=/sanscratch/$SLURM_JOBID
+export MYLOCALSCRATCH=/localscratch/$SLURM_JOBID
+cd $MYLOCALSCRATCH
+pwd
+# CPU serial job example
+date  # look in stdout file
+datee # look in stderr file
+env | grep ^SLURM
+echo "hello world of slurm"
+touch foo
+ls -l foo
+# CPU mpi example, note: no -np flag, no --hostfile
+mpirun $HOME/slurm/hello_c
+# GPU docker example, be sure to select rtx2080s gpu
+# manual "wrapper" setup to find idle gpu, on localhost
+# cuda 10.2
+gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` "
+echo ""; echo "docker running on gpu $HOSTNAME:$gpuid"; echo ""
+#export CUDA_VISIBLE_DEVICES=$gpuid # or NV_GPU
+NV_GPU=$gpuid \
+nvidia-docker run --rm -u $(id -u):$(id -g) \
+-v /$HOME:/mnt/$USER \
+-v /home/apps:/mnt/apps \
+-v /usr/local:/mnt/local \
+nvcr.io/nvidia/tensorflow:19.09-py2 python \
+/mnt/$USER/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/run_tests.py \
+--num_gpus=1 --batch_size=64 \
+--model=resnet50 \
+--variable_update=parameter_server > $HOME/slurm/out.docker
+sleep 5m  # so you can query job/node with scontrol
+</code>
+===== Script Output =====
+The relevant sections of the script above should generate output like this
+** err ** file
+<code>
+# the stderr file starts with
+/var/spool/slurmd/job1000056/slurm_script: line 47: datee: command not found
+# lots of tensorflow warnings
+<snip>
+# and that apps writes to stderr
+----------------------------------------------------------------------
+Ran 104 tests in 197.454s
+OK (skipped=12)
+</code>
+** out ** file
+<code>
+/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
+/localscratch/1000056
+Thu Oct 14 10:36:22 EDT 2021
+SLURM_NODELIST=n79
+SLURM_JOB_NAME=test
+SLURMD_NODENAME=n79
+SLURM_TOPOLOGY_ADDR=n79
+SLURM_THREADS_PER_CORE=2
+SLURM_PRIO_PROCESS=0
+SLURM_NODE_ALIASES=(null)
+SLURM_GPUS_ON_NODE=4
+SLURM_TOPOLOGY_ADDR_PATTERN=node
+SLURM_JOB_GPUS=0,1,2,3
+SLURM_NNODES=1
+SLURM_JOBID=1000056
+SLURM_NTASKS=8
+SLURM_TASKS_PER_NODE=8
+SLURM_WORKING_CLUSTER=slurmcluster:greentail52:6817:9472:109
+SLURM_CONF=/usr/local/slurm-21.08.1/etc/slurm.conf
+SLURM_JOB_ID=1000056
+SLURM_JOB_USER=hmeij
+SLURM_JOB_UID=8216
+SLURM_NODEID=0
+SLURM_SUBMIT_DIR=/zfshomes/hmeij/slurm
+SLURM_TASK_PID=257975
+SLURM_NPROCS=8
+SLURM_CPUS_ON_NODE=48
+SLURM_PROCID=0
+SLURM_JOB_NODELIST=n79
+SLURM_LOCALID=0
+SLURM_JOB_GID=623
+SLURM_JOB_CPUS_PER_NODE=48
+SLURM_CLUSTER_NAME=slurmcluster
+SLURM_GTIDS=0
+SLURM_SUBMIT_HOST=greentail52
+SLURM_JOB_PARTITION=test
+SLURM_JOB_NUM_NODES=1
+SLURM_MEM_PER_NODE=192
+hello world of slurm
+-rw-r--r-- 1 hmeij its 0 Oct 14 10:36 foo
+Hello, world, I am 0 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 1 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 4 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 5 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 2 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 3 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 6 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+Hello, world, I am 7 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 112)
+docker running on gpu n79:3
+</code>
+and the **out.docker** file
+<code>
+================
+== TensorFlow ==
+================
+NVIDIA Release 19.09 (build 8044706)
+TensorFlow Version 1.14.0
+Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.
+<snip>
+Generating training model
+Initializing graph
+Running warm up
+Done warm up
+<snip>
+----------------------------------------------------------------
+Generating training model
+Initializing graph
+Running warm up
+Done warm up
+Step    Img/sec total_loss
+       images/sec: 110.3 +/- 0.0 (jitter = 0.0)        1.156250119209290
+       images/sec: 213.2 +/- 1096.9 (jitter = 2299.9)  7.638743400573730
+       images/sec: 309.0 +/- 822.7 (jitter = 246.5)    -2.596951484680176
+       images/sec: 398.6 +/- 649.2 (jitter = 123.3)    -35.271511077880859
+----------------------------------------------------------------
+total images/sec: 378.12
+----------------------------------------------------------------
+</code>
+===== Feedback =====
+If there are errors on this page, or mistatements, let me know. As we test and improve the setup to mimic a production environment I will update the page (and mark those entries with timestamp/signature).
+ --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/14 15:20//

DokuWiki

User Tools

Site Tools

Differences

Page Tools