Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:208 [DokuWiki]

User Tools

Site Tools


cluster:208

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
cluster:208 [2021/10/18 14:55]
hmeij07 [Changes]
cluster:208 [2022/06/03 08:30]
hmeij07 [gpu testing]
Line 382: Line 382:
  
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 09:16//  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/15 09:16//
 +
 +===== gpu testing =====
 +
 +  * test standalone slurm v 21.08.1
 +  * n33-n37 each: 4 gpus, 16 cores, 16 threads, 32 cpus
 +  * submit one at a time, observe  
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n33 only
 +  * "resources" reason at 17th submit, used up 16 cores and 16 threads
 +  * all on same gpu
 +  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n33 only
 +  * "resources" reason at 17th submit too, same reason
 +  * all gpus used? nope, all on the same one 0
 +  * redoing above with a  ''export CUDA_VISIBLE_DEVICES=`shuf -i 0-3 -n 1`''
 +  * even distribution across all gpus, 17th submit reason too
 +  * part=test, n 1, B 1:1:1, cuda_visible not set, no node specified, n[33-34] avail
 +  * while submitting 34 jobs, one at a time (30s delay), slurm fills up n33 first (all on gpu 0)
 +  * 17th submit goes to n34, gpu 1 (weird), n33 state=alloc, n34 state=mix
 +  * 33th job, "Resources" reason, job pending
 +  * 34th job, "Priority" reason (?), job pending
 +  * all n33 and n34 jobs on single gpu without cuda_visible set
 +  * how that works with gpu util at 100% with one jobs is beyond me
 +  * do all 16 jobs log the same wall time? Yes, between 10.10 and 10.70 hours.
 +
 +  * ohpc v2.4 slurm v 20.11.8 
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n100 only
 +  * hit a bug, you must specify cpus-per-gpu **and** mem-per-gpu
 +  * then slurm detects 4 gpus on allocated node and allows 4 jobs on a single allocated gpu
 +  * twisted logic
 +  * so recent openhpc version but old slurm version in software stack
 +  * trying standalone install on openhpc prod cluster - auth/munge error, no go
 +  * do all 4 jobs have similar wall time? Yes on n100 varies from 0.6 to 0.7 hours
 +
 +  * ohpc v2.4 slurm v 20.11.8 
 +  * part=test, n 1, B 1:1:1, cuda_visible=0, no node specified, n78 only
 +  * same as above but all 16 jobs run on gpu 0
 +  * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon?
 +  * all 16 jobs finished, waal times of 3.11 to 3.60 hours
 +
 +
 +
 +
  
 ===== Changes ===== ===== Changes =====
Line 430: Line 471:
 </code> </code>
  
-** Weight Priority **+** Partition Priority **
  
-Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirementsSo CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).+If set you can list more than one queue...
  
 +<code>
 + srun --partition=exx96,amber128,mwgpu  --mem=1024  --gpus=1  --gres=gpu:any sleep 60 &
 +</code>
 +
 +The above will fill up n79 first, then n78, then n36...
 +
 +** Node Weight Priority **
 +
 +Weight nodes by the memory per logical core: jobs will be allocated the nodes with the lowest weight which satisfies their requirements. So CPU jobs will be routed last to gpu queues because they have the highest weight (=lowest priority).
 <code> <code>
 hp12: 12/8 = 1.5 hp12: 12/8 = 1.5
Line 460: Line 510:
 Makes for a better 1-1 relationship of physical core to ''ntask'' yet the "hyperthreads" are still available to user jobs but physical cores are consumed first, if I got all this right. Makes for a better 1-1 relationship of physical core to ''ntask'' yet the "hyperthreads" are still available to user jobs but physical cores are consumed first, if I got all this right.
  
-Deployed. My need to set threads=1 and cpus=(quantity of physical cores)+Deployed. My need to set threads=1 and cpus=(quantity of physical cores)...this went horribly wrong it resaulted in sockets=1 setting and threads=1 for each node.
  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/18 14:32//  --- //[[hmeij@wesleyan.edu|Henk]] 2021/10/18 14:32//
 +
 +We did set number of cpus per gpu (12 for n79) and minimum memory settings. Now we experience 5th job pending with 48 cpus consumed. When using sbatch set -n 8 because sbatch will override defaults.
 +
 +<code>
 + srun --partition=test  --mem=1024  --gres=gpu:geforce_rtx_2080_s:1 sleep 60 &
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/208.txt ยท Last modified: 2022/11/02 13:28 by hmeij07