This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:218 [2022/06/30 13:42] hmeij07 [Modules] |
cluster:218 [2023/09/27 08:52] (current) hmeij07 [Resources] |
||
---|---|---|---|
Line 28: | Line 28: | ||
* August 2022 is designated **migration** period | * August 2022 is designated **migration** period | ||
* Queues '' | * Queues '' | ||
+ | |||
+ | |||
+ | ==== Quick Start Slurm Guide ==== | ||
+ | |||
+ | Jump to the **Rocky8/ | ||
+ | |||
+ | There is also detailed information on Amber20/ | ||
+ | |||
+ | * [[cluster: | ||
==== Basic Commands ==== | ==== Basic Commands ==== | ||
Line 47: | Line 56: | ||
# sorta like bhosts -l | # sorta like bhosts -l | ||
| | ||
+ | |||
+ | # sorta like bstop/ | ||
+ | scontrol suspend job 1000001 | ||
+ | scontrol resume job 1000001 | ||
# sorta like bhist -l | # sorta like bhist -l | ||
Line 72: | Line 85: | ||
You must request **resources**, | You must request **resources**, | ||
+ | |||
+ | Details | ||
+ | |||
+ | * https:// | ||
Some common examples are: | Some common examples are: | ||
Line 83: | Line 100: | ||
#SBATCH -n 8 # tasks=S*C*T | #SBATCH -n 8 # tasks=S*C*T | ||
#SBATCH -B 2:4:1 # S: | #SBATCH -B 2:4:1 # S: | ||
+ | #SBATCH --mem=250 | ||
+ | #SBATCH --ntasks-per-node=1 # perhaps needed to override oversubscribe | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | |||
GPU control | GPU control | ||
- | #SBATCH --cpus-per-gpu=1 | + | #SBATCH --cpus-per-gpu=1 |
- | #SBATCH --mem-per-gpu=7168 | + | #SBATCH --mem-per-gpu=7168 |
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 100: | Line 121: | ||
</ | </ | ||
+ | |||
+ | ** Pending Jobs ** | ||
+ | |||
+ | I keep having to inform users that with -n 1 and -cpu 1 your can still go in pending state because user forgot to reserve memory so silly slurm assumes your jobs needs all the node's memory. Here is my template then | ||
+ | |||
+ | < | ||
+ | |||
+ | FirstName, your jobs are pending because you did not request memory | ||
+ | and if not then slurm assumes you need all memory, silly. | ||
+ | Command " | ||
+ | |||
+ | JobId=1062052 JobName=3a_avgHbond_CPU | ||
+ | | ||
+ | | ||
+ | |||
+ | I looked (command "ssh n?? top -u username -b -n 1", look for the VIRT value) | ||
+ | and you need less than 1G per job so with --mem=1024 and n=1 and cpu=1 | ||
+ | you should be able to load 48 jobs onto n100. | ||
+ | Consult output of command "sinfo -lN" | ||
+ | |||
+ | </ | ||
+ | |||
==== MPI ==== | ==== MPI ==== | ||
Line 336: | Line 379: | ||
- | * ''/ | + | * ''/ |
< | < | ||
Line 361: | Line 404: | ||
#SBATCH -B 1:1:1 # S: | #SBATCH -B 1:1:1 # S: | ||
###SBATCH -B 2:4:1 # S: | ###SBATCH -B 2:4:1 # S: | ||
- | #SBATCH --cpus-per-gpu=1 | ||
- | #SBATCH --mem-per-gpu=7168 | ||
# | # | ||
# GPU control | # GPU control | ||
+ | #SBATCH --cpus-per-gpu=1 | ||
+ | #SBATCH --mem-per-gpu=7168 | ||
###SBATCH --gres=gpu: | ###SBATCH --gres=gpu: | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 379: | Line 422: | ||
cd $MYLOCALSCRATCH | cd $MYLOCALSCRATCH | ||
- | ### AMBER20 | + | ### AMBER20 |
#source / | #source / | ||
# OR # | # OR # | ||
Line 448: | Line 491: | ||
==== CentOS7 Slurm Template ==== | ==== CentOS7 Slurm Template ==== | ||
- | In this job template I have it setup to run '' | + | In this job template I have it setup to run '' |
- | Note also that we're running mwgpu' | + | Note also that we're running mwgpu' |
- | * ''/ | + | * ''/ |
< | < | ||
Line 479: | Line 522: | ||
# | # | ||
# GPU control | # GPU control | ||
- | ###SBATCH --gres=gpu: | ||
- | ###SBATCH --gres=gpu: | ||
###SBATCH --cpus-per-gpu=1 | ###SBATCH --cpus-per-gpu=1 | ||
###SBATCH --mem-per-gpu=7168 | ###SBATCH --mem-per-gpu=7168 | ||
+ | ###SBATCH --gres=gpu: | ||
+ | ###SBATCH --gres=gpu: | ||
# | # | ||
# Node control | # Node control | ||
Line 488: | Line 531: | ||
#SBATCH --nodelist=n88 | #SBATCH --nodelist=n88 | ||
+ | # may or may not be needed, centos7 login env | ||
+ | source $HOME/ | ||
+ | which ifort # should be the parallel studio 2016 version | ||
# unique job scratch dirs | # unique job scratch dirs | ||
Line 507: | Line 553: | ||
- | ###source / | + | |
- | source / | + | ###source / |
+ | source / | ||
# stage the data | # stage the data | ||
cp -r ~/ | cp -r ~/ | ||
Line 567: | Line 614: | ||
July 2022 is for **testing...** lots to learn! | July 2022 is for **testing...** lots to learn! | ||
- | Kudos to Abhilash for working our way through all this. | + | Kudos to Abhilash |
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||