This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
cluster:208 [2021/10/14 19:36] hmeij07 created |
cluster:208 [2022/06/01 18:28] hmeij07 [gpu testing] |
||
---|---|---|---|
Line 5: | Line 5: | ||
===== Slurm Test Env ===== | ===== Slurm Test Env ===== | ||
+ | |||
+ | There is a techie page at this location **[[cluster: | ||
+ | |||
+ | __This page is intended for users__ to get started with the Slurm scheduler. '' | ||
+ | |||
+ | ** Default Environment ** | ||
+ | |||
+ | Slurm was compiled within this environment | ||
+ | |||
+ | < | ||
+ | |||
+ | # installer found / | ||
+ | export CUDAHOME=/ | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | which nvcc | ||
+ | |||
+ | # openmpi, just in case | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | which mpirun | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Slurm Location ** | ||
+ | |||
+ | < | ||
+ | |||
+ | # for now, symbolic link to slurm-21.08.01 | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | </ | ||
+ | |||
+ | ===== Basic Commands ===== | ||
+ | |||
+ | < | ||
+ | |||
+ | # sorta like bqueues | ||
+ | $ sinfo -l | ||
+ | Thu Oct 14 09:27:02 2021 | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | test* up | ||
+ | mwgpu up | ||
+ | amber128 | ||
+ | exx96 up | ||
+ | |||
+ | # more node info | ||
+ | $ sinfo -lN | ||
+ | Thu Oct 14 13:57:12 2021 | ||
+ | NODELIST | ||
+ | n37 1 | ||
+ | n37 1 | ||
+ | n78 1 amber128 | ||
+ | n78 1 | ||
+ | n79 1 | ||
+ | n79 1 | ||
+ | |||
+ | |||
+ | # sorta like bsub | ||
+ | $ sbatch run | ||
+ | Submitted batch job 1000002 | ||
+ | |||
+ | # sorta like bjobs | ||
+ | $ squeue | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | |||
+ | # sorta like bhosts -l | ||
+ | $ scontrol show node n78 | ||
+ | NodeName=n78 Arch=x86_64 CoresPerSocket=8 | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | |||
+ | # sorta like bhist -l | ||
+ | $ scontrol show job 1000002 | ||
+ | JobId=1000002 JobName=test | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | # sorta like bkill | ||
+ | $ scancel 1000003 | ||
+ | |||
+ | </ | ||
+ | |||
+ | ===== Documentation ===== | ||
+ | |||
+ | * manual pages for conf files or commands, for example | ||
+ | * '' | ||
+ | * '' | ||
+ | * developers documentation web site | ||
+ | * https:// | ||
+ | * https:// | ||
+ | * https:// | ||
+ | * https:// | ||
+ | * etc... | ||
+ | |||
+ | ===== Overview ===== | ||
+ | |||
+ | From the information above it is a matter of learning new terminology and how to control devices. As you can see '' | ||
+ | |||
+ | Same on the cpu only compute nodes. Features could be created for memory footprints (for example " | ||
+ | |||
+ | On the cpu resource requests: You may request 1 or more nodes, 1 or more sockets per node, 1 or more cores (physical) per socket or 1 or more threads (logical + physical) per core. Such a request can be fine grained or not; just request a node with '' | ||
+ | |||
+ | //Note: this oversubscribing is not working yet. I can only get 4 simultaneous jobs running. Maybe there is a conflict with Openlava jobs. Should isolate a node and do further testing. After isolation (n37), 4 jobs with -n 4 exhausts number of physical cores. Is that why 5th job goes pending? Solved, see Changes section.// | ||
+ | |||
+ | ===== MPI ===== | ||
+ | |||
+ | Slurm has a builtin MPI flavor, I suggest you do not rely on it. The documentation states that on major release upgrades the '' | ||
+ | |||
+ | For now, we'll rely on PATH/ | ||
+ | |||
+ | '' | ||
+ | |||
+ | < | ||
+ | |||
+ | $ srun --partition=mwgpu -n 4 -B 1:4:1 --mem=1024 sleep 60 & | ||
+ | |||
+ | </ | ||
+ | |||
+ | For more details on srun consult https:// | ||
+ | |||
+ | ===== Script ===== | ||
+ | |||
+ | Putting it all together a job submission script might look like example below. Simply submit to sbatch, assuming the job script name is '' | ||
+ | |||
+ | < | ||
+ | $ sbatch run | ||
+ | </ | ||
+ | |||
+ | ** Sample Submit Script | ||
+ | |||
+ | < | ||
+ | |||
+ | #!/bin/bash | ||
+ | # [found at XStream] | ||
+ | # Slurm will IGNORE all lines after the FIRST BLANK LINE, | ||
+ | # even the ones containing #SBATCH. | ||
+ | # Always put your SBATCH parameters at the top of your batch script. | ||
+ | # Took me days to find, [constraint=|gres=] were not working ... silly behavior -Henk | ||
+ | # | ||
+ | # GENERAL | ||
+ | #SBATCH --job-name=" | ||
+ | #SBATCH --output=out | ||
+ | #SBATCH --error=err | ||
+ | #SBATCH --mail-type=END | ||
+ | #SBATCH --mail-user=username@wesleyan.edu | ||
+ | # | ||
+ | # NODE control | ||
+ | #SBATCH -N 1 # default, nodes | ||
+ | ###SBATCH --nodelist=n78, | ||
+ | ###SBATCH --constraint=hasLocalscratch | ||
+ | ###SBATCH --constraint=hasLocalscratch1tb | ||
+ | ###SBATCH --exclusive | ||
+ | ###SBATCH --oversubscribe | ||
+ | # | ||
+ | # CPU control | ||
+ | #SBATCH -n 8 # total cpus request is tasks=N(S*C*T) | ||
+ | #SBATCH -B 1:4:2 # S: | ||
+ | # | ||
+ | # GPU control | ||
+ | ###SBATCH --gres=gpu: | ||
+ | ###SBATCH --gres=gpu: | ||
+ | ###SBATCH --gres=gpu: | ||
+ | #SBATCH --gres=gpu: | ||
+ | |||
+ | # ENV control | ||
+ | # openmpi | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | which mpirun | ||
+ | |||
+ | # unique job scratch dir created(prolog)/ | ||
+ | export MYSANSCRATCH=/ | ||
+ | export MYLOCALSCRATCH=/ | ||
+ | cd $MYLOCALSCRATCH | ||
+ | pwd | ||
+ | |||
+ | # CPU serial job example | ||
+ | date # look in stdout file | ||
+ | datee # look in stderr file | ||
+ | env | grep ^SLURM | ||
+ | echo "hello world of slurm" | ||
+ | touch foo | ||
+ | ls -l foo | ||
+ | |||
+ | # CPU mpi example, note: no -np flag, no --hostfile | ||
+ | mpirun $HOME/ | ||
+ | |||
+ | # GPU docker example, be sure to select rtx2080s gpu | ||
+ | # manual " | ||
+ | # cuda 10.2 | ||
+ | gpuid=" | ||
+ | echo ""; | ||
+ | #export CUDA_VISIBLE_DEVICES=$gpuid # or NV_GPU | ||
+ | |||
+ | NV_GPU=$gpuid \ | ||
+ | nvidia-docker run --rm -u $(id -u):$(id -g) \ | ||
+ | -v / | ||
+ | -v / | ||
+ | -v / | ||
+ | nvcr.io/ | ||
+ | / | ||
+ | --num_gpus=1 --batch_size=64 \ | ||
+ | --model=resnet50 \ | ||
+ | --variable_update=parameter_server > $HOME/ | ||
+ | |||
+ | sleep 5m # so you can query job/node with scontrol | ||
+ | |||
+ | </ | ||
+ | |||
+ | ===== Script Output ===== | ||
+ | |||
+ | The relevant sections of the script above should generate output like this | ||
+ | |||
+ | ** err ** file | ||
+ | |||
+ | < | ||
+ | |||
+ | # the stderr file starts with | ||
+ | / | ||
+ | |||
+ | # lots of tensorflow warnings | ||
+ | < | ||
+ | |||
+ | # and that apps writes to stderr | ||
+ | |||
+ | ---------------------------------------------------------------------- | ||
+ | Ran 104 tests in 197.454s | ||
+ | |||
+ | OK (skipped=12) | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** out ** file | ||
+ | |||
+ | < | ||
+ | |||
+ | / | ||
+ | / | ||
+ | Thu Oct 14 10:36:22 EDT 2021 | ||
+ | SLURM_NODELIST=n79 | ||
+ | SLURM_JOB_NAME=test | ||
+ | SLURMD_NODENAME=n79 | ||
+ | SLURM_TOPOLOGY_ADDR=n79 | ||
+ | SLURM_THREADS_PER_CORE=2 | ||
+ | SLURM_PRIO_PROCESS=0 | ||
+ | SLURM_NODE_ALIASES=(null) | ||
+ | SLURM_GPUS_ON_NODE=4 | ||
+ | SLURM_TOPOLOGY_ADDR_PATTERN=node | ||
+ | SLURM_JOB_GPUS=0, | ||
+ | SLURM_NNODES=1 | ||
+ | SLURM_JOBID=1000056 | ||
+ | SLURM_NTASKS=8 | ||
+ | SLURM_TASKS_PER_NODE=8 | ||
+ | SLURM_WORKING_CLUSTER=slurmcluster: | ||
+ | SLURM_CONF=/ | ||
+ | SLURM_JOB_ID=1000056 | ||
+ | SLURM_JOB_USER=hmeij | ||
+ | SLURM_JOB_UID=8216 | ||
+ | SLURM_NODEID=0 | ||
+ | SLURM_SUBMIT_DIR=/ | ||
+ | SLURM_TASK_PID=257975 | ||
+ | SLURM_NPROCS=8 | ||
+ | SLURM_CPUS_ON_NODE=48 | ||
+ | SLURM_PROCID=0 | ||
+ | SLURM_JOB_NODELIST=n79 | ||
+ | SLURM_LOCALID=0 | ||
+ | SLURM_JOB_GID=623 | ||
+ | SLURM_JOB_CPUS_PER_NODE=48 | ||
+ | SLURM_CLUSTER_NAME=slurmcluster | ||
+ | SLURM_GTIDS=0 | ||
+ | SLURM_SUBMIT_HOST=greentail52 | ||
+ | SLURM_JOB_PARTITION=test | ||
+ | SLURM_JOB_NUM_NODES=1 | ||
+ | SLURM_MEM_PER_NODE=192 | ||
+ | hello world of slurm | ||
+ | -rw-r--r-- 1 hmeij its 0 Oct 14 10:36 foo | ||
+ | Hello, world, I am 0 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 1 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 4 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 5 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 2 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 3 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 6 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | Hello, world, I am 7 of 8, (Open MPI v4.0.4, package: Open MPI hmeij@greentail52 Distribution, | ||
+ | |||
+ | docker running on gpu n79:3 | ||
+ | |||
+ | </ | ||
+ | |||
+ | and the **out.docker** file | ||
+ | |||
+ | < | ||
+ | |||
+ | ================ | ||
+ | == TensorFlow == | ||
+ | ================ | ||
+ | |||
+ | NVIDIA Release 19.09 (build 8044706) | ||
+ | TensorFlow Version 1.14.0 | ||
+ | |||
+ | Container image Copyright (c) 2019, NVIDIA CORPORATION. | ||
+ | Copyright 2017-2019 The TensorFlow Authors. | ||
+ | |||
+ | < | ||
+ | |||
+ | Generating training model | ||
+ | Initializing graph | ||
+ | Running warm up | ||
+ | Done warm up | ||
+ | |||
+ | < | ||
+ | |||
+ | ---------------------------------------------------------------- | ||
+ | Generating training model | ||
+ | Initializing graph | ||
+ | Running warm up | ||
+ | Done warm up | ||
+ | Step Img/sec total_loss | ||
+ | 1 | ||
+ | 2 | ||
+ | 3 | ||
+ | 4 | ||
+ | ---------------------------------------------------------------- | ||
+ | total images/sec: 378.12 | ||
+ | ---------------------------------------------------------------- | ||
+ | |||
+ | </ | ||
+ | |||
+ | ===== Feedback ===== | ||
+ | |||
+ | If there are errors on this page, or mistatements, | ||
+ | |||
+ | --- // | ||
+ | |||
+ | ===== gpu testing ===== | ||
+ | |||
+ | * test standalone slurm v 21.08.1 | ||
+ | * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus | ||
+ | * submit one at a time, observe | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * " | ||
+ | * all on same gpu | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only | ||
+ | * " | ||
+ | * all gpus used? nope, all on the same one 0 | ||
+ | * redoing above with a '' | ||
+ | * even distribution across all gpus, 17th submit reason too | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail | ||
+ | * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0) | ||
+ | * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, | ||
+ | * 33th job, " | ||
+ | * 34th job, " | ||
+ | * all n33 and n34 jobs on single gpu without cuda_visible set | ||
+ | * how that works with gpu util at 100% with one jobs is beyond me | ||
+ | * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours. | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu | ||
+ | * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu | ||
+ | * twisted logic | ||
+ | * so recent openhpc version but old slurm version in software stack | ||
+ | * trying standalone install on openhpc prod cluster - auth/munge error, no go | ||
+ | * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours | ||
+ | |||
+ | * ohpc v2.4 slurm v 20.11.8 | ||
+ | * part=test, n 1, B 1:1:1, cuda_visible=0, | ||
+ | * same as above but all 16 jobs run on gpu 0 | ||
+ | * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon? | ||
+ | * | ||
+ | |||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | |||
+ | ** OverSubscribe ** | ||
+ | |||
+ | Suggestion was made to set '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH --job-name=sleep | ||
+ | #SBATCH --partition=mwgpu | ||
+ | ###SBATCH -n 1 | ||
+ | #SBATCH -B 1:1:1 | ||
+ | #SBATCH --mem=1024 | ||
+ | sleep 60 | ||
+ | </ | ||
+ | |||
+ | --- // | ||
+ | |||
+ | ** GPU-CPU cores ** | ||
+ | |||
+ | Noticed this with debug level on in slurmd.log. No action taken. | ||
+ | |||
+ | < | ||
+ | |||
+ | # n37: old gpu model bound to all physical cpu cores | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n78: somewhat dated gpu model, bound to top/bot of physical cores (16) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | # n79, more recent gpu model, same bound pattern of top/bot (24) | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | GRES[gpu] Type: | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Partition Priority ** | ||
+ | |||
+ | If set you can list more than one queue... | ||
+ | |||
+ | < | ||
+ | srun --partition=exx96, | ||
+ | </ | ||
+ | |||
+ | The above will fill up n79 first, then n78, then n36... | ||
+ | |||
+ | ** Node Weight Priority ** | ||
+ | |||
+ | Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority). | ||
+ | < | ||
+ | hp12: 12/8 = 1.5 | ||
+ | tinymem: 32/20 = 1.6 | ||
+ | mw128: 128/24 = 5.333333 | ||
+ | mw256: 256/16 = 16 | ||
+ | |||
+ | exx96: 96/24 = 4 | ||
+ | amber128: 128/16 = 8 | ||
+ | mwgpu = 256/16 = 16 | ||
+ | </ | ||
+ | |||
+ | Or more arbitrary (based on desired cpu node comsumption of cpu jobs. No action taken. | ||
+ | |||
+ | < | ||
+ | tinymem | ||
+ | mw128 20 | ||
+ | mw256fd | ||
+ | mwgpu 40 + HasMem256 feature | ||
+ | amber128 | ||
+ | exx96 80 | ||
+ | </ | ||
+ | |||
+ | ** CR_CPU_Memory ** | ||
+ | |||
+ | Makes for a better 1-1 relationship of physical core to '' | ||
+ | |||
+ | Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node. | ||
+ | --- // | ||
+ | |||
+ | We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults. | ||
+ | |||
+ | < | ||
+ | srun --partition=test | ||
+ | </ | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||