Differences

This shows you the differences between two versions of the page.

--- cluster:208 [2022/06/01 18:28] – [gpu testing] hmeij07
+++ cluster:208 [2022/11/02 17:28] (current) – [gpu testing] hmeij07
@@ Line 418: / Line 418: @@
   * same as above but all 16 jobs run on gpu 0
   * so the limit to 4 jobs on rtx5000 gpu is a hardware phenomenon?
-  *
+  * all 16 jobs finished, waal times of 3.11 to 3.60 hours
+===== gpu testing 2 =====
+Newer 2022 version seems to have reversed the override options for oversubscribe. So here is our testing...back to CR_CPU_Memory and OverSubscribe=No   --- //[[hmeij@wesleyan.edu|Henk]] 2022/11/02 13:23//
+<code>
+CR_Socket_Memory
+PartitionName=test Nodes=n[100-101]
+Default=YES MaxTime=INFINITE State=UP
+OverSubscribe=No DefCpuPerGPU=12
+MPI jobs with -N 1, -n 8 and -B 2:4:1
+no override options, cpus=48
+--mem=2048, cpus=48
+and --cpus-per-task=1, cpus=48
+and  --ntasks-per-node=8, cpus=24
+MPI jobs with -N, -n 8 and -B 1:8:1
+--mem=10240 cpus=48
+and --cpus-per-task=1, cpus=48
+and  --ntasks-per-node=8, cpus=24
+GPU jobs with -N 1, -n 1 and -B 1:1:1
+no override options, no cuda export, cpus=48
+--cpus-per-gpu=1, cpus=24
+and --mem-per-gpu=7168, cpus=1 (pending
+while other gpu runs in queue but gpus are free???)
+GPU jobs with -N 1, -n 1 and -B 1:1:1
+no override options, yes cuda export, cpus=48
+--cpus-per-gpu=1, cpus=24
+and --mem-per-gpu=7168, cpus=1 (resources pending
+while a gpu job runs, gpus are free, then it executes)
+...suddenly the cpus=1 turns into cpus=24
+when submitting, slurm confused becuase of all
+the job cancellations?
+CR_CPU_Memory test=no, mwgpu=force:16
+PartitionName=test Nodes=n[100-101]
+Default=YES MaxTime=INFINITE State=UP
+OverSubscribe=No DefCpuPerGPU=12
+MPI jobs with -N 1, -n 8 and -B 2:4:1
+no override options, cpus=8 (queue fills across nodes,
+but only one job per node, test & mwgpu)
+--mem=1024, cpus=8 (queue fills first node ...,
+but only three jobs per node, test 3x8=24 full 4th job pending &
+mwgpu 17th job goes pending on n33, overloaded with -n 8 !!)
+(not needed) --cpus-per-task=?, cpus=
+(not needed)  --ntasks-per-node=?, cpus=
+GPU jobs with -N 1, -n 1 and -B 1:1:1 on test
+no override options, no cuda export, cpus=12 (one gpu per node)
+--cpus-per-gpu=1, cpus=1 (one gpu per node)
+and --mem-per-gpu=7168, cpus=1 (both override options
+required else all mem allocated!, max 4 jobs per node,
+fills first node first...cuda export not needed)
+with cuda export, same node, same gpu,
+with "no" enabled multiple jobs per gpu not accepted
+GPU jobs with -N 1, -n 1 and -B 1:1:1 on mwgpu
+--cpus-per-gpu=1,
+and --mem-per-gpu=7168, cpus=1
+(same node, same gpu, cuda export set,
+with "force:16" enabled 4 jobs per gpu accepted,
+potential for overloading!)
+</code>