User Tools

Site Tools


cluster:207


Back

Slurm Test Env

Getting a head start on our new login node plus two cpu+gpu compute node project. Hardware has been purchased but there is long delivery time. Meanwhile it makes sense to setup a standalone Slurm scheduler and do some testing and have as a backup. Slurm will be running on greentail52 with a some compute nodes.

This page just intended to keep documentation sources handy. Go to the Users page Slurm Test Env

SLURM documentation

# main page
https://slurm.schedmd.com/

# Slurm Quick Start User Guide
https://slurm.schedmd.com/quickstart.html
https://slurm.schedmd.com/tutorials.html
  
# Slurm Quick Start Administrator Guide
https://slurm.schedmd.com/quickstart_admin.html

ldconfig -n /usr/lib64 to find libslurm.so
support for accounting will be built if the mysql development library is present
the host's name is "mcri" and the name "emcri" private network communication
nodes can be in more than one partition 
extensive sample configuration file is provided in etc/slurm.conf.example
at least these:
# Sample /etc/slurm.conf for mcr.llnl.gov
scrontrol examples...

https://slurm.schedmd.com/slurm.conf.html
section: node configuration

The node range expression can contain one pair of square brackets with a sequence of comma-separated numbers and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or "lx[15,18,32-33]")

Features (hasGPU, hasRTX5000)
are intended to be used to filter nodes eligible to run jobs via the --constraint argument.
comma-delimited list of arbitrary strings indicative of some characteristic 
associated with the node. 
Feature=hasLocalscratch2T,
second node list for 5T if in same queue
GRES (boolean)
A comma-delimited list of generic resources specifications for a node. 
The format is: "<name>[:<type>][:no_consume]:<number>[K|M|G]". 
"Gres=gpu:tesla:1,cpu:haswell:2"
section: partition configuration
DisableRootJobs=YES
Nodes="[n110-n111],[n79-n90]"

https://slurm.schedmd.com/gres.html#GPU_Management
setting up gres.conf

give GPU jobs priority using the Multifactor Priority plugin:
https://slurm.schedmd.com/priority_multifactor.html#tres
PriorityWeightTRES=GRES/gpu=1000
example here: https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf
requires faishare thus the database

https://slurm.schedmd.com/mc_support.html
multi-core, multi-thread
--sockets-per-node=S 	Number of sockets in a node to dedicate to a job (minimum)
--cores-per-socket=C 	Number of cores in a socket to dedicate to a job (minimum)
--threads-per-core=T 	Minimum number of threads in a core to dedicate to a job. 
-B S[:C[:T]] 	Combined shortcut option for --sockets-per-node, --cores-per_cpu, --threads-per_core 
Total cpus requested = (Nodes) x (S x C x T)

StateSaveLocation (useful for upgrades or downgrades)
upgrade once a year, head node first then nodes
to install the new version of Slurm to a unique directory and use a 
symbolic link to point the directory in your PATH to the version of 
Slurm you would like to use
(how does OpenHPC handle this?)
MPI libraries with Slurm integration should be recompiled, 
libslurm.so increases major releases
beware while upgrading of SlurmdTimeout and SlurmctldTimeout values 
(scontrol change them)

https://slurm.schedmd.com/configurator.html
full version of config tool, 
skip sockets (+cpu+physical+logical, core+mem, no backfill)
see below, first attempt

https://slurm.schedmd.com/configless_slurm.html
more traffic, stick to local files in sync cluster wide

https://slurm.schedmd.com/priority_multifactor.html
fair share (requires database), decay, reset monthly, favor small jobs
PriorityType=priority/multifactor
openhpc does have slurm-slurmdbd-ohpc rpm, it's just a service daemon, skip

https://slurm.schedmd.com/sched_config.html
The backfill scheduling plugin is loaded by default
SchedulerType=sched/backfill

https://slurm.schedmd.com/cons_res.html
exclusive use default policy in Slurm can result in inefficient utilization 
SelectType=select/cons_tres (includes all con_res options, adds gpu options)
set SlurmctldLogFile and SlurmdLogFile locations (else syslog)

https://slurm.schedmd.com/accounting.html
sacct (text file), sreport (database), settings below for minimal overhead
JobCompType=jobcomp/filetxt and JobCompLoc=/var/log/slurm/job_completions
logrotate, Send a SIGUSR2 signal to the slurmctld daemon after moving the files

XSEDE Resources

What is XSEDE
https://portal.xsede.org/documentation-overview

Advanced Slurm
https://cvw.cac.cornell.edu/SLURM/default

MUNGE installation

download latest release https://dun.github.io/munge/
from https://github.com/dun/munge/releases/tag/munge-0.5.14o

dun.gpg
munge-0.5.14.tar.xz
munge-0.5.14.tar.xz.asc

stage in tmp/ then build RPM file
https://github.com/dun/munge/wiki/Installation-Guide

rpmbuild -tb munge-0.5.14.tar.xz

# try on n78 first, as root

Wrote: /zfshomes/hmeij/rpmbuild/RPMS/x86_64/munge-0.5.14-1.el7.x86_64.rpm
Wrote: /zfshomes/hmeij/rpmbuild/RPMS/x86_64/munge-devel-0.5.14-1.el7.x86_64.rpm
Wrote: /zfshomes/hmeij/rpmbuild/RPMS/x86_64/munge-libs-0.5.14-1.el7.x86_64.rpm

# as root
cd /zfshomes/hmeij/rpmbuild/RPMS/x86_64/
rpm -ivh munge-0.5.14-1.el7.x86_64.rpm \
 munge-devel-0.5.14-1.el7.x86_64.rpm munge-libs-0.5.14-1.el7.x86_64.rpm

# create a key on greentail52, copy to n78 the test node
[root@greentail52 ~]# sudo -u munge /usr/sbin/mungekey --verbose
mungekey: Info: Created "/etc/munge/munge.key" with 1024-bit key
[root@greentail52 ~]# ls -l /etc/munge/munge.key
-rw------- 1 munge munge 128 Oct  5 08:28 /etc/munge/munge.key
[root@greentail52 ~]# scp -p /etc/munge/munge.key n78:/etc/munge/
munge.key                                     100%  128   223.8KB/s   00:00 

systemctl enable munge
systemctl start munge

 munge -n
 munge -n | unmunge
 munge -n -t 10 | ssh n78 unmunge

# remote decode working?
[root@greentail52 ~]# munge -n -t 10 | ssh n78 unmunge
STATUS:          Success (0)
ENCODE_HOST:     greentail52 (192.168.102.251)
ENCODE_TIME:     2021-10-05 09:27:45 -0400 (1633440465)
DECODE_TIME:     2021-10-05 09:27:44 -0400 (1633440464)
TTL:             10
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             root (0)
GID:             root (0)
LENGTH:          0

# file locations
[root@greentail52 ~]# munged --help 
  -S, --socket=PATH        Specify local socket [/run/munge/munge.socket.2]
  --key-file=PATH          Specify key file [/etc/munge/munge.key]
  --log-file=PATH          Specify log file [/var/log/munge/munged.log]
  --pid-file=PATH          Specify PID file [/run/munge/munged.pid]
  --seed-file=PATH         Specify PRNG seed file [/var/lib/munge/munged.seed]

SLURM installation

Configured and compiled on greentail52 despite not having gpus…only library manager is needed (nvml)

# cuda 9.2 ... 
# installer found /usr/local/cuda on ''greentail''

# just in case
export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
which mpirun

# /usr/local/slurm is symbolic link to slurm-21.08.1
./configure \
--prefix=/usr/local/slurm-21.08.1 \
--sysconfdir=/usr/local/slurm-21.08.1/etc \
 | tee -a install.log
# skip # --with-nvml=/usr/local/n37-cuda-9.2 \
# skip # -with-hdf5=no  \
# known hdf5 library problem when including --with-nvml

grep -i nvml install.log
config.status: creating src/plugins/gpu/nvml/Makefile

====
Libraries have been installed in:
   /usr/local/slurm-21.08.1/lib/slurm

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the 'LD_RUN_PATH' environment variable
     during linking
   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to '/etc/ld.so.conf'
====

# for now
export PATH=/usr/local/slurm/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH

For general accounting we may rely on simple text file

From job completions file, JOB #3, convert Start and End times to epoch seconds

StartTime=2021-10-06T14:32:37 EndTime=2021-10-06T14:37:40

date --date='2021/10/06 14:32:37' +"%s"
1633545157

date --date='2021/10/06 14:37:40' +"%s"
1633545460

EndTime - StartTime = 1633545460-1633545157 = 303 seconds

Full Version Slurm Config Tool

  • lets start with this file and build up/out
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=slurmcluster
SlurmctldHost=cottontail2
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
Epilog=/share/apps/lsf/slurm-epilog.sh
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
Prolog=/share/apps/lsf/slurm-prolog.sh
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SrunEpilog=/share/apps/lsf/slurm-epilog.sh
SrunProlog=/share/apps/lsf/slurm-prolog.sh
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskEpilog=/share/apps/lsf/slurm-epilog.sh
TaskPlugin=task/affinity
TaskProlog=/share/apps/lsf/slurm-prolog.sh
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=300
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/builtin
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
PriorityType=priority/basic
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=0
#PriorityCalcPeriod=
#PriorityFavorSmall=YES
#PriorityMaxAge=14-0
#PriorityUsageResetPeriod=MONTHLY
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
JobCompLoc=/var/log/slurmjobs.txt
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=n[110-111] CPUs=2 RealMemory=192 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
#
#
# PARTITIONS
PartitionName=test Nodes=n[110-111] Default=YES MaxTime=INFINITE State=UP
#
#


Back

cluster/207.txt · Last modified: 2022/08/02 08:08 by hmeij07