This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:218 [2022/07/02 19:36] hmeij07 [Rocky8 Slurm Template] |
cluster:218 [2023/10/14 19:24] (current) hmeij07 [Resources] |
||
---|---|---|---|
Line 28: | Line 28: | ||
* August 2022 is designated **migration** period | * August 2022 is designated **migration** period | ||
* Queues '' | * Queues '' | ||
+ | |||
+ | |||
+ | ==== Quick Start Slurm Guide ==== | ||
+ | |||
+ | Jump to the **Rocky8/ | ||
+ | |||
+ | There is also detailed information on Amber20/ | ||
+ | |||
+ | * [[cluster: | ||
==== Basic Commands ==== | ==== Basic Commands ==== | ||
Line 47: | Line 56: | ||
# sorta like bhosts -l | # sorta like bhosts -l | ||
| | ||
+ | |||
+ | # sorta like bstop/ | ||
+ | scontrol suspend job 1000001 | ||
+ | scontrol resume job 1000001 | ||
# sorta like bhist -l | # sorta like bhist -l | ||
Line 72: | Line 85: | ||
You must request **resources**, | You must request **resources**, | ||
+ | |||
+ | Details | ||
+ | |||
+ | * https:// | ||
Some common examples are: | Some common examples are: | ||
Line 83: | Line 100: | ||
#SBATCH -n 8 # tasks=S*C*T | #SBATCH -n 8 # tasks=S*C*T | ||
#SBATCH -B 2:4:1 # S: | #SBATCH -B 2:4:1 # S: | ||
+ | #SBATCH --mem=250 | ||
+ | #SBATCH --ntasks-per-node=1 # perhaps needed to override oversubscribe | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | |||
GPU control | GPU control | ||
- | #SBATCH --cpus-per-gpu=1 | + | #SBATCH --cpus-per-gpu=1 |
- | #SBATCH --mem-per-gpu=7168 | + | #SBATCH --mem-per-gpu=7168 |
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 100: | Line 121: | ||
</ | </ | ||
+ | |||
+ | ** Pending Jobs ** | ||
+ | |||
+ | I keep having to inform users that with -n 1 and -cpu 1 your job can still go in pending state because user forgot to reserve memory ... so silly slurm assumes your job needs all the node's memory. Here is my template then | ||
+ | |||
+ | < | ||
+ | |||
+ | FirstName, your jobs are pending because you did not request memory | ||
+ | and if not then slurm assumes you need all memory, silly. | ||
+ | Command " | ||
+ | |||
+ | JobId=1062052 JobName=3a_avgHbond_CPU | ||
+ | | ||
+ | | ||
+ | |||
+ | I looked (command "ssh n?? top -u username -b -n 1", look for the VIRT value) | ||
+ | and you need less than 1G per job so with --mem=1024 and n=1 and cpu=1 | ||
+ | you should be able to load 48 jobs onto n100. | ||
+ | Consult output of command "sinfo -lN" | ||
+ | |||
+ | </ | ||
+ | |||
==== MPI ==== | ==== MPI ==== | ||
Line 488: | Line 531: | ||
#SBATCH --nodelist=n88 | #SBATCH --nodelist=n88 | ||
+ | # may or may not be needed, centos7 login env | ||
+ | source $HOME/ | ||
+ | which ifort # should be the parallel studio 2016 version | ||
# unique job scratch dirs | # unique job scratch dirs |