Differences

This shows you the differences between two versions of the page.

--- cluster:218 [2022/07/02 15:36]
hmeij07 [Rocky8 Slurm Template]
+++ cluster:218 [2023/09/27 08:52]
hmeij07 [Resources]
@@ Line 28: / Line 28: @@
   * August 2022 is designated **migration** period
   * Queues ''hp12'' and ''mwgpu'' (centos6) will be serviced by Openlava, not Slurm
+==== Quick Start Slurm  Guide ====
+Jump to the **Rocky8/CentOs7 script templates** listed in the menu of this page, top right.
+There is also detailed information on Amber20/Amber22 on this page with script examples.
+  * [[cluster:214|Tada]] new head node
 ==== Basic Commands ====
@@ Line 47: / Line 56: @@
 # sorta like bhosts -l
  scontrol show node n78
+# sorta like bstop/bresume
+scontrol suspend job 1000001
+scontrol resume job 1000001
 # sorta like bhist -l
@@ Line 72: / Line 85: @@
 You must request **resources**, that is for example number of cpu cores or which gpu model to use. ** If you do not request resources, Slurm will assume you need all the node's resources** and thus prevent other jobs from running on that node.
+Details
+  * https://slurm.schedmd.com/cons_res_share.html
 Some common examples are:
@@ Line 83: / Line 100: @@
 #SBATCH -n 8     # tasks=S*C*T
 #SBATCH -B 2:4:1 # S:C:T=sockets/node:cores/socket:threads/core
+#SBATCH --mem=250           # needed to override oversubscribe
+#SBATCH --ntasks-per-node=1 # perhaps needed to override oversubscribe
+#SBATCH --cpus-per-task=1   # needed to override oversubscribe
 GPU control
-#SBATCH --cpus-per-gpu=1
+#SBATCH --cpus-per-gpu=1                  # needed to override oversubscribe
-#SBATCH --mem-per-gpu=7168
+#SBATCH --mem-per-gpu=7168                # needed to override oversubscribe
 #SBATCH --gres=gpu:geforce_gtx_1080_ti:1  # n[78], amber128
 #SBATCH --gres=gpu:geforce_rtx_2080_s:1   # n[79-90], exx96
@@ Line 100: / Line 121: @@
 </code>
+** Pending Jobs  **
+I keep having to inform users that with -n 1 and -cpu 1 your can still go in pending state because user forgot to reserve memory so silly slurm assumes your jobs needs all the node's memory. Here is my template then
+<code>
+FirstName, your jobs are pending because you did not request memory
+and if not then slurm assumes you need all memory, silly.
+Command "scontrol show job JOBID" will reveal ...
+JobId=1062052 JobName=3a_avgHbond_CPU
+   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:1
+   TRES=cpu=1,mem=191047M,node=1,billing=1    <---------
+I looked (command "ssh n?? top -u username -b -n 1", look for the VIRT value)
+and you need less than 1G per job so with --mem=1024 and n=1 and cpu=1
+you should be able to load 48 jobs onto n100.
+Consult output of command "sinfo -lN"
+</code>
 ==== MPI ====
@@ Line 488: / Line 531: @@
 #SBATCH --nodelist=n88
+# may or may not be needed, centos7 login env
+source $HOME/.bashrc
+which ifort           # should be the parallel studio 2016 version
 # unique job scratch dirs

DokuWiki

User Tools

Site Tools

Differences

Page Tools