User Tools

Site Tools


cluster:207

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:207 [2021/10/11 13:11]
hmeij07 [Slurm Test]
cluster:207 [2023/10/27 14:47] (current)
hmeij07
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
 +**Make sure munge/unmunge work between 1.3/2.4, that date is in sync (else you get error #16)**
  
 ===== Slurm Test Env ===== ===== Slurm Test Env =====
Line 7: Line 8:
 Getting a head start on our new login node plus two cpu+gpu compute node project. Hardware has been purchased but there is long delivery time. Meanwhile it makes sense to setup a standalone Slurm scheduler and do some testing and have as a backup. Slurm will be running on ''greentail52'' with a some compute nodes. Getting a head start on our new login node plus two cpu+gpu compute node project. Hardware has been purchased but there is long delivery time. Meanwhile it makes sense to setup a standalone Slurm scheduler and do some testing and have as a backup. Slurm will be running on ''greentail52'' with a some compute nodes.
  
 +This page just intended to keep documentation sources handy. Go to the **Users** page [[cluster:208|Slurm Test Env]]
  
-**SLURM documentation**+==== SLURM documentation ====
  
 <code> <code>
Line 33: Line 35:
 https://slurm.schedmd.com/slurm.conf.html https://slurm.schedmd.com/slurm.conf.html
 section: node configuration section: node configuration
 +
 +The node range expression can contain one pair of square brackets with a sequence of comma-separated numbers and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or "lx[15,18,32-33]")
 +
 Features (hasGPU, hasRTX5000) Features (hasGPU, hasRTX5000)
 are intended to be used to filter nodes eligible to run jobs via the --constraint argument. are intended to be used to filter nodes eligible to run jobs via the --constraint argument.
Line 49: Line 54:
 https://slurm.schedmd.com/gres.html#GPU_Management https://slurm.schedmd.com/gres.html#GPU_Management
 setting up gres.conf setting up gres.conf
 +
 +give GPU jobs priority using the Multifactor Priority plugin:
 +https://slurm.schedmd.com/priority_multifactor.html#tres
 +PriorityWeightTRES=GRES/gpu=1000
 +example here: https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf
 +requires faishare thus the database
  
 https://slurm.schedmd.com/mc_support.html https://slurm.schedmd.com/mc_support.html
Line 106: Line 117:
  
  
-** MUNGE installation**+==== MUNGE installation ====
  
 <code> <code>
Line 172: Line 183:
 </code> </code>
  
-** SLURM installation **+==== SLURM installation Updated ====
  
 <code> <code>
  
-source /share/apps/CENTOS7/amber/miniconda3/etc/profile.d/conda.sh +export PATH=/usr/local/cuda/bin:$PATH 
-export PATH=/share/apps/CENTOS7/amber/miniconda3/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
-export LD_LIBRARY_PATH=/share/apps/CENTOS7/amber/miniconda3/lib:$LD_LIBRARY_PATH +
-which mpirun python conda+
  
-# cuda 9.2 +[root@cottontail2 slurm-22.05.2]# which gcc mpicc nvcc 
-export CUDAHOME=/usr/local/n37-cuda-9.2 +/opt/ohpc/pub/compiler/gcc/9.4.0/bin/gcc 
-export PATH=/usr/local/n37-cuda-9.2/bin:$PATH +/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/bin/mpicc 
-export LD_LIBRARY_PATH=/usr/local/n37-cuda-9.2/lib64:$LD_LIBRARY_PATH +/usr/local/cuda/bin/nvcc
-which nvcc+
  
- export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH 
- export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH 
- which mpirun 
  
 +./configure \
 +--prefix=/usr/local/slurm-22.05.2 \
 +--sysconfdir=/usr/local/slurm-22.05.2/etc \
 +--with-nvml=/usr/local/cuda
 +make
 +make install
  
 +export PATH=/usr/local/slurm/bin:$PATH
 +export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH
 +
 +[root@cottontail2 slurm-22.05.2]# find /usr/local/slurm-22.05.2/ -name auth_munge.so
 +/usr/local/slurm-22.05.2/lib/slurm/auth_munge.so
 +
 +</code>
 +
 +
 +==== SLURM installation ====
 +
 +Configured and compiled on ''greentail52'' despite not having gpus...only library manager is needed (nvml)
 +
 +<code>
 +
 +# cuda 9.2 ... 
 +# installer found /usr/local/cuda on ''greentail''
 +
 +# just in case
 +export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
 +which mpirun
 +
 +# /usr/local/slurm is symbolic link to slurm-21.08.1
 ./configure \ ./configure \
 --prefix=/usr/local/slurm-21.08.1 \ --prefix=/usr/local/slurm-21.08.1 \
 --sysconfdir=/usr/local/slurm-21.08.1/etc \ --sysconfdir=/usr/local/slurm-21.08.1/etc \
---with-nvml=/usr/local/n37-cuda-9.2 \ + | tee -a install.log 
--with-hdf5=no | tee -a install.log +# skip # --with-nvml=/usr/local/n37-cuda-9.2 \ 
 +# skip # -with-hdf5=no  \
 # known hdf5 library problem when including --with-nvml # known hdf5 library problem when including --with-nvml
  
Line 203: Line 238:
 config.status: creating src/plugins/gpu/nvml/Makefile config.status: creating src/plugins/gpu/nvml/Makefile
  
 +====
 +Libraries have been installed in:
 +   /usr/local/slurm-21.08.1/lib/slurm
  
 +If you ever happen to want to link against installed libraries
 +in a given directory, LIBDIR, you must either use libtool, and
 +specify the full pathname of the library, or use the '-LLIBDIR'
 +flag during linking and do at least one of the following:
 +   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
 +     during execution
 +   - add LIBDIR to the 'LD_RUN_PATH' environment variable
 +     during linking
 +   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
 +   - have your system administrator add LIBDIR to '/etc/ld.so.conf'
 +====
 +
 +# for now
 export PATH=/usr/local/slurm/bin:$PATH export PATH=/usr/local/slurm/bin:$PATH
 export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/slurm/lib:$LD_LIBRARY_PATH
  
-From job completions file, JOB #3+</code> 
 + 
 + 
 +==== General Accounting ==== 
 + 
 +<code> 
 + 
 +From job completions file, JOB #3, convert Start and End times to epoch seconds
  
 StartTime=2021-10-06T14:32:37 EndTime=2021-10-06T14:37:40 StartTime=2021-10-06T14:32:37 EndTime=2021-10-06T14:37:40
Line 222: Line 280:
  
  
-**Full Version Slurm Config Tool**+==== Slurm Config Tool ==== 
 + 
 +  lets start with this file and build up/out
  
 <code> <code>
Line 372: Line 432:
 # #
 # COMPUTE NODES # COMPUTE NODES
-NodeName=n[110-111] CPUs=2 RealMemory=192 CoresPerSocket=12 ThreadsPerCore=12 State=UNKNOWN+NodeName=n[110-111] CPUs=2 RealMemory=192 CoresPerSocket=12 ThreadsPerCore=State=UNKNOWN
 # #
 # #
cluster/207.1633972300.txt.gz · Last modified: 2021/10/11 13:11 by hmeij07