This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:218 [2022/06/30 17:25] hmeij07 [Basic Commands] |
cluster:218 [2025/06/19 15:43] (current) hmeij07 |
||
---|---|---|---|
Line 28: | Line 28: | ||
* August 2022 is designated **migration** period | * August 2022 is designated **migration** period | ||
* Queues '' | * Queues '' | ||
+ | |||
+ | |||
+ | ==== Quick Start Slurm Guide ==== | ||
+ | |||
+ | Jump to the **Rocky8/ | ||
+ | |||
+ | There is also detailed information on Amber20/ | ||
+ | |||
+ | * [[cluster: | ||
==== Basic Commands ==== | ==== Basic Commands ==== | ||
Line 47: | Line 56: | ||
# sorta like bhosts -l | # sorta like bhosts -l | ||
| | ||
+ | |||
+ | # sorta like bstop/ | ||
+ | scontrol suspend job 1000001 | ||
+ | scontrol resume job 1000001 | ||
# sorta like bhist -l | # sorta like bhist -l | ||
Line 59: | Line 72: | ||
* manual pages for conf files or commands, for example | * manual pages for conf files or commands, for example | ||
- | * '' | + | * '' |
* '' | * '' | ||
* etc ...(see above commands) | * etc ...(see above commands) | ||
Line 72: | Line 85: | ||
You must request **resources**, | You must request **resources**, | ||
+ | |||
+ | Details | ||
+ | |||
+ | * https:// | ||
Some common examples are: | Some common examples are: | ||
< | < | ||
+ | |||
+ | Account | ||
+ | #SBATCH --account=pifacultyusername | ||
NODE control | NODE control | ||
Line 83: | Line 103: | ||
#SBATCH -n 8 # tasks=S*C*T | #SBATCH -n 8 # tasks=S*C*T | ||
#SBATCH -B 2:4:1 # S: | #SBATCH -B 2:4:1 # S: | ||
+ | #SBATCH --mem=250 | ||
+ | #SBATCH --ntasks-per-node=1 # perhaps needed to override oversubscribe | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | |||
GPU control | GPU control | ||
- | #SBATCH --cpus-per-gpu=1 | + | #SBATCH --cpus-per-gpu=1 |
- | #SBATCH --mem-per-gpu=7168 | + | #SBATCH --mem-per-gpu=7168 |
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 100: | Line 124: | ||
</ | </ | ||
+ | |||
+ | ** Pending Jobs ** | ||
+ | |||
+ | I keep having to inform users that with -n 1 and -cpu 1 your job can still go in pending state because user forgot to reserve memory ... so silly slurm assumes your job needs all the node's memory. Here is my template then | ||
+ | |||
+ | < | ||
+ | |||
+ | FirstName, your jobs are pending because you did not request memory | ||
+ | and if not then slurm assumes you need all memory, silly. | ||
+ | Command " | ||
+ | |||
+ | JobId=1062052 JobName=3a_avgHbond_CPU | ||
+ | | ||
+ | | ||
+ | |||
+ | I looked (command "ssh n?? top -u username -b -n 1", look for the VIRT value) | ||
+ | and you need less than 1G per job so with --mem=1024 and n=1 and cpu=1 | ||
+ | you should be able to load 48 jobs onto n100. | ||
+ | Consult output of command "sinfo -lN" | ||
+ | |||
+ | </ | ||
+ | |||
==== MPI ==== | ==== MPI ==== | ||
Line 105: | Line 151: | ||
Slurm has a builtin MPI flavor. I suggest you do not rely on it. The documentation states that on major release upgrades the '' | Slurm has a builtin MPI flavor. I suggest you do not rely on it. The documentation states that on major release upgrades the '' | ||
- | There is a handy parallel job launcher which may be of use, it is called '' | + | There is a handy parallel job launcher which may be of use, it is called '' |
< | < | ||
Line 207: | Line 253: | ||
[hmeij@cottontail2 ~]$ module avail | [hmeij@cottontail2 ~]$ module avail | ||
- | ------------------------ / | + | ------------------- / |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | extrae/3.7.0 | + | |
+ | | ||
+ | | ||
+ | imb/2019.6 | ||
+ | mfem/4.3 ptscotch/6.0.6 trilinos/ | ||
+ | mumps/ | ||
- | ------------------------------ / | + | ------------------------- / |
- | | + | |
- | | + | |
- | | + | |
+ | | ||
+ | | ||
+ | | ||
- | -------------------------------- / | + | --------------------------- / |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
+ | | ||
+ | | ||
- | --------------------------- / | + | ----------------------- / |
| | ||
Line 326: | Line 382: | ||
- | * ''/ | + | * ''/ |
< | < | ||
Line 338: | Line 394: | ||
# | # | ||
# GENERAL | # GENERAL | ||
+ | #SBATCH --account=pifacultyusername | ||
#SBATCH --job-name=" | #SBATCH --job-name=" | ||
#SBATCH --output=out | #SBATCH --output=out | ||
Line 351: | Line 408: | ||
#SBATCH -B 1:1:1 # S: | #SBATCH -B 1:1:1 # S: | ||
###SBATCH -B 2:4:1 # S: | ###SBATCH -B 2:4:1 # S: | ||
- | #SBATCH --cpus-per-gpu=1 | ||
- | #SBATCH --mem-per-gpu=7168 | ||
# | # | ||
# GPU control | # GPU control | ||
+ | #SBATCH --cpus-per-gpu=1 | ||
+ | #SBATCH --mem-per-gpu=7168 | ||
###SBATCH --gres=gpu: | ###SBATCH --gres=gpu: | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 369: | Line 426: | ||
cd $MYLOCALSCRATCH | cd $MYLOCALSCRATCH | ||
- | ### AMBER20 | + | ### AMBER20 |
#source / | #source / | ||
# OR # | # OR # | ||
Line 438: | Line 495: | ||
==== CentOS7 Slurm Template ==== | ==== CentOS7 Slurm Template ==== | ||
- | In this job template I have it setup to run '' | + | In this job template I have it setup to run '' |
- | Note also that we're running mwgpu' | + | Note also that we're running mwgpu' |
- | * ''/ | + | * ''/ |
< | < | ||
Line 454: | Line 511: | ||
# | # | ||
# GENERAL | # GENERAL | ||
+ | #SBATCH --account=pifacultyusername | ||
#SBATCH --job-name=" | #SBATCH --job-name=" | ||
#SBATCH --output=out | #SBATCH --output=out | ||
Line 469: | Line 527: | ||
# | # | ||
# GPU control | # GPU control | ||
- | ###SBATCH --gres=gpu: | ||
- | ###SBATCH --gres=gpu: | ||
###SBATCH --cpus-per-gpu=1 | ###SBATCH --cpus-per-gpu=1 | ||
###SBATCH --mem-per-gpu=7168 | ###SBATCH --mem-per-gpu=7168 | ||
+ | ###SBATCH --gres=gpu: | ||
+ | ###SBATCH --gres=gpu: | ||
# | # | ||
# Node control | # Node control | ||
Line 478: | Line 536: | ||
#SBATCH --nodelist=n88 | #SBATCH --nodelist=n88 | ||
+ | # may or may not be needed, centos7 login env | ||
+ | source $HOME/ | ||
+ | which ifort # should be the parallel studio 2016 version | ||
# unique job scratch dirs | # unique job scratch dirs | ||
Line 497: | Line 558: | ||
- | ###source / | + | |
- | source / | + | ###source / |
+ | source / | ||
# stage the data | # stage the data | ||
cp -r ~/ | cp -r ~/ | ||
Line 557: | Line 619: | ||
July 2022 is for **testing...** lots to learn! | July 2022 is for **testing...** lots to learn! | ||
- | Kudos to Abhilash for working our way through all this. | + | Kudos to Abhilash |
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||