Differences

This shows you the differences between two versions of the page.

--- cluster:223 [2023/09/06 13:31]
hmeij07 [Test]
+++ cluster:223 [2023/09/18 20:56] (current)
hmeij07
@@ Line 4: / Line 4: @@
 ==== cuda toolkit ====
-Upgrading to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade [[cluster:172|K20 Redo]]
+Upgrading Cuda to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade [[cluster:172|K20 Redo]]
 For legacy hardware find the latest legacy driver here
@@ Line 47: / Line 47: @@
 # update link to this version: yes
+# no -silent -driver ...
 ===========
@@ Line 65: / Line 66: @@
-# no nvidia_modprobe? nope, don't know where drivers are not in /dev/nvidia
+# no nvidia_modprobe? ls -l /dev/nvidia?
 reboot
@@ Line 115: / Line 116: @@
-==== Test ====
+==== Testing ====
 Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with
@@ Line 124: / Line 128: @@
   * #SBATCH --mem-per-gpu=7168
-For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. The other oddity is Slurm finds the nvidia drivers but where are they? The gref.conf settings work. Odd. Yet pmemd.cuda is running on gpu ID 0.
+For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. Good news is Amber runs fine, no need to recompile.
 <code>
+# from slurmd.log
+[2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4
 JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
 1053052 mwgpu             test         hmeij   R          0:09     1     8             0              n33
-[hmeij@cottontail2 slurm]$ grep k20 /usr/local/slurm/etc/gres.conf
-NodeName=n[33-36]  Name=gpu Type=tesla_k20m          File=/dev/nvidia[0-3]
-[hmeij@cottontail2 slurm]$ ls -l /dev/nvidia?
-ls: cannot access '/dev/nvidia?': No such file or directory
 [hmeij@cottontail2 slurm]$ ssh n33 gpu-info
@@ Line 142: / Line 144: @@
 , Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 %
 , Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %
 [hmeij@cottontail2 slurm]$ ssh n33 gpu-process
 gpu_name, gpu_id, pid, process_name
@@ Line 147: / Line 150: @@
 </code>
+  * #SBATCH --cpus-per-gpu=1
+Adding this does force Slurm to allocate just a single cpu. Now try 4 gpu jobs per node. No need for CUDA_VISIBLE_DEVICES setting.
+<code>
+JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
+1053992 mwgpu             test         hmeij   R          0:04     1     1             0              n33
+[hmeij@cottontail2 slurm]$ for i in `seq 1 6`; do sbatch run.centos; sleep 30; squeue | grep hmeij; done
+# output
+Submitted batch job 1054000
+1054000 mwgpu             test         hmeij   R          0:30     1     1             0              n33
+Submitted batch job 1054001
+1054001 mwgpu             test         hmeij   R          0:30     1     1             0              n33
+1054000 mwgpu             test         hmeij   R          1:00     1     1             0              n33
+Submitted batch job 1054002
+1054002 mwgpu             test         hmeij   R          0:30     1     1             0              n33
+1054001 mwgpu             test         hmeij   R          1:00     1     1             0              n33
+1054000 mwgpu             test         hmeij   R          1:30     1     1             0              n33
+Submitted batch job 1054003
+1054003 mwgpu             test         hmeij   R          0:30     1     1             0              n33
+1054002 mwgpu             test         hmeij   R          1:00     1     1             0              n33
+1054001 mwgpu             test         hmeij   R          1:30     1     1             0              n33
+1054000 mwgpu             test         hmeij   R          2:00     1     1             0              n33
+Submitted batch job 1054004
+1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
+1054003 mwgpu             test         hmeij   R          1:00     1     1             0              n33
+1054002 mwgpu             test         hmeij   R          1:30     1     1             0              n33
+1054001 mwgpu             test         hmeij   R          2:00     1     1             0              n33
+1054000 mwgpu             test         hmeij   R          2:30     1     1             0              n33
+Submitted batch job 1054005
+1054005 mwgpu             test         hmeij  PD          0:00     1     1             0(Nodes required f
+1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
+1054003 mwgpu             test         hmeij   R          1:30     1     1             0              n33
+1054002 mwgpu             test         hmeij   R          2:00     1     1             0              n33
+1054001 mwgpu             test         hmeij   R          2:30     1     1             0              n33
+1054000 mwgpu             test         hmeij   R          3:00     1     1             0              n33
+[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
+id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
+, Tesla K20m, 40, 95 MiB, 4648 MiB, 100 %, 25 %
+, Tesla K20m, 40, 95 MiB, 4648 MiB, 94 %, 23 %
+, Tesla K20m, 35, 95 MiB, 4648 MiB, 93 %, 21 %
+, Tesla K20m, 28, 95 MiB, 4648 MiB, 97 %, 25 %
+</code>
+Other software does need to be recompiled as it links to specific version of libraries rather than the generic libName.so (lammps).
+Script ~hmeij/slurm/run.centos.lammps, setup env, get help page.
+<code>
+/share/apps/intel/parallel_studio_xe_2016_update3/compilers_and_libraries_2016.3.210/linux/bin/intel64/ifort
+/usr/local/cuda/bin/nvcc
+/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
+/share/apps/CENTOS7/python/3.8.3/bin/python
+/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/lmp_mpi-cuda-single-single
+        linux-vdso.so.1 =>  (0x00007ffd714ec000)
+        libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe443b9a000)
+        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fe44390b000)
+        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007fe442223000)
+        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe436a74000)
+        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe436870000)
+        libmpi.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libmpi.so.40 (0x00007fe43655b000)
+        libstdc++.so.6 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libstdc++.so.6 (0x00007fe4361d9000)
+        libm.so.6 => /lib64/libm.so.6 (0x00007fe435ed7000)
+        libgcc_s.so.1 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libgcc_s.so.1 (0x00007fe435cc0000)
+        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe435aa4000)
+        libc.so.6 => /lib64/libc.so.6 (0x00007fe4356d6000)
+        /lib64/ld-linux-x86-64.so.2 (0x00007fe443def000)
+        librt.so.1 => /lib64/librt.so.1 (0x00007fe4354ce000)
+        libopen-rte.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-rte.so.40 (0x00007fe435218000)
+        libopen-pal.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-pal.so.40 (0x00007fe434f09000)
+        libutil.so.1 => /lib64/libutil.so.1 (0x00007fe434d06000)
+        libz.so.1 => /lib64/libz.so.1 (0x00007fe434af0000)
+Large-scale Atomic/Molecular Massively Parallel Simulator - 28 Mar 2023 - Development
+Usage example: lmp_mpi-cuda-single-single -var t 300 -echo screen -in in.alloy
+List of command line options supported by this LAMMPS executable:
+<snip>
+# hmmm, using -suffix gpu it does not jump on gpus, generic non-gpu libthread error
+# same version rocky8/cuda-11.6 works, centos7/cuda-10.2 works, all "make" compiles
+# try "cmake" compile on n33-n36
+# libspace tarball download fails on file hash and
+# yields a  status: [1;"Unsupported protocol"]  error for ML-PACE
+# without ML-SPACE hash fails for opencl-loarder third partty, bad url
+# https://download.lammps.org/thirdparty/opencl-loader-opencl-loadewer-version...tgz
+# then extract in _deps/ dir
+# and added -D GPU_LIBRARY=../lib/gpu/libgpu.a ala QUIP_LIBRARY
+# that works, cmake compile binary jumps on multiple gpus
+[hmeij@n35 sharptail]$ mpirun -n 2 \
+/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp \
+-suffix gpu -in in.colloid
+[root@greentail52 ~]# ssh n35 gpu-process
+gpu_name, gpu_id, pid, process_name
+Tesla K20m, 0, 9911, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
+Tesla K20m, 1, 9912, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
+# some stats, colloid example
+cpu, 1 gpu
+Total wall time: 0:05:49
+cpus, 2 gpus
+Total wall time: 0:03:58
+cpus, 4 gpus
+Total wall time: 0:02:23
+cpus, 4 gpus
+Total wall time: 0:02:23
+# but the ML-PACE hash  error is different, so no go there
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools