Upgrading Cuda to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo
For legacy hardware find the latest legacy driver here
Then download the driver series selected here
Then download the latest toolkit supported (11.2) for K20m on centos7
# install drivers (uninstalls existing drivers, accept defaults) cd /usr/local/src DRIVER_VERSION=470.199.02 curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run sh ./NVIDIA-Linux-x86_64-470.199.02.run # install toolkit wget https://developer.download.nvidia.com/compute/cuda/11.2.1/local_installers/cuda_11.2.1_460.32.03_linux.run sh cuda_11.2.1_460.32.03_linux.run # prompts │ CUDA Installer │ │ - [X] Driver │ │ [X] 460.32.03 │ │ + [X] CUDA Toolkit 11.2 │ │ [ ] CUDA Samples 11.2 │ │ [ ] CUDA Demo Suite 11.2 │ │ [ ] CUDA Documentation 11.2 │ │ Options │ │ Install # update link to this version: yes # no -silent -driver ... =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-11.2/ Samples: Not Selected Please make sure that - PATH includes /usr/local/cuda-11.2/bin - LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Logfile is /var/log/cuda-installer.log # no nvidia_modprobe? ls -l /dev/nvidia? reboot # then export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH [hmeij@n33 ~]$ nvidia-smi Tue Sep 5 14:43:15 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K20m On | 00000000:02:00.0 Off | 0 | | N/A 26C P8 25W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m On | 00000000:03:00.0 Off | 0 | | N/A 27C P8 26W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m On | 00000000:83:00.0 Off | 0 | | N/A 25C P8 24W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m On | 00000000:84:00.0 Off | 0 | | N/A 26C P8 25W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ # startup slurm, finds gpus? yes # old compiled code compatible? test
Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with
For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. Good news is Amber runs fine, no need to recompile.
# from slurmd.log [2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4 JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON) 1053052 mwgpu test hmeij R 0:09 1 8 0 n33 [hmeij@cottontail2 slurm]$ ssh n33 gpu-info id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem 0, Tesla K20m, 36, 95 MiB, 4648 MiB, 100 %, 25 % 1, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 % 2, Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 % 3, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 % [hmeij@cottontail2 slurm]$ ssh n33 gpu-process gpu_name, gpu_id, pid, process_name Tesla K20m, 0, 28394, pmemd.cuda
Adding this does force Slurm to allocate just a single cpu. Now try 4 gpu jobs per node. No need for CUDA_VISIBLE_DEVICES setting.
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON) 1053992 mwgpu test hmeij R 0:04 1 1 0 n33 [hmeij@cottontail2 slurm]$ for i in `seq 1 6`; do sbatch run.centos; sleep 30; squeue | grep hmeij; done # output Submitted batch job 1054000 1054000 mwgpu test hmeij R 0:30 1 1 0 n33 Submitted batch job 1054001 1054001 mwgpu test hmeij R 0:30 1 1 0 n33 1054000 mwgpu test hmeij R 1:00 1 1 0 n33 Submitted batch job 1054002 1054002 mwgpu test hmeij R 0:30 1 1 0 n33 1054001 mwgpu test hmeij R 1:00 1 1 0 n33 1054000 mwgpu test hmeij R 1:30 1 1 0 n33 Submitted batch job 1054003 1054003 mwgpu test hmeij R 0:30 1 1 0 n33 1054002 mwgpu test hmeij R 1:00 1 1 0 n33 1054001 mwgpu test hmeij R 1:30 1 1 0 n33 1054000 mwgpu test hmeij R 2:00 1 1 0 n33 Submitted batch job 1054004 1054004 mwgpu test hmeij PD 0:00 1 1 0 (Resources) 1054003 mwgpu test hmeij R 1:00 1 1 0 n33 1054002 mwgpu test hmeij R 1:30 1 1 0 n33 1054001 mwgpu test hmeij R 2:00 1 1 0 n33 1054000 mwgpu test hmeij R 2:30 1 1 0 n33 Submitted batch job 1054005 1054005 mwgpu test hmeij PD 0:00 1 1 0(Nodes required f 1054004 mwgpu test hmeij PD 0:00 1 1 0 (Resources) 1054003 mwgpu test hmeij R 1:30 1 1 0 n33 1054002 mwgpu test hmeij R 2:00 1 1 0 n33 1054001 mwgpu test hmeij R 2:30 1 1 0 n33 1054000 mwgpu test hmeij R 3:00 1 1 0 n33 [hmeij@cottontail2 slurm]$ ssh n33 gpu-info id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem 0, Tesla K20m, 40, 95 MiB, 4648 MiB, 100 %, 25 % 1, Tesla K20m, 40, 95 MiB, 4648 MiB, 94 %, 23 % 2, Tesla K20m, 35, 95 MiB, 4648 MiB, 93 %, 21 % 3, Tesla K20m, 28, 95 MiB, 4648 MiB, 97 %, 25 %
Other software does need to be recompiled as it links to specific version of libraries rather than the generic libName.so (lammps).
Script ~hmeij/slurm/run.centos.lammps, setup env, get help page.
/share/apps/intel/parallel_studio_xe_2016_update3/compilers_and_libraries_2016.3.210/linux/bin/intel64/ifort /usr/local/cuda/bin/nvcc /share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun /share/apps/CENTOS7/python/3.8.3/bin/python /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/lmp_mpi-cuda-single-single linux-vdso.so.1 => (0x00007ffd714ec000) libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe443b9a000) libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fe44390b000) libcuda.so.1 => /lib64/libcuda.so.1 (0x00007fe442223000) libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe436a74000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fe436870000) libmpi.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libmpi.so.40 (0x00007fe43655b000) libstdc++.so.6 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libstdc++.so.6 (0x00007fe4361d9000) libm.so.6 => /lib64/libm.so.6 (0x00007fe435ed7000) libgcc_s.so.1 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libgcc_s.so.1 (0x00007fe435cc0000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe435aa4000) libc.so.6 => /lib64/libc.so.6 (0x00007fe4356d6000) /lib64/ld-linux-x86-64.so.2 (0x00007fe443def000) librt.so.1 => /lib64/librt.so.1 (0x00007fe4354ce000) libopen-rte.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-rte.so.40 (0x00007fe435218000) libopen-pal.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-pal.so.40 (0x00007fe434f09000) libutil.so.1 => /lib64/libutil.so.1 (0x00007fe434d06000) libz.so.1 => /lib64/libz.so.1 (0x00007fe434af0000) Large-scale Atomic/Molecular Massively Parallel Simulator - 28 Mar 2023 - Development Usage example: lmp_mpi-cuda-single-single -var t 300 -echo screen -in in.alloy List of command line options supported by this LAMMPS executable: <snip> # hmmm, using -suffix gpu it does not jump on gpus, generic non-gpu libthread error # same version rocky8/cuda-11.6 works, centos7/cuda-10.2 works, all "make" compiles # try "cmake" compile on n33-n36 # libspace tarball download fails on file hash and # yields a status: [1;"Unsupported protocol"] error for ML-PACE # without ML-SPACE hash fails for opencl-loarder third partty, bad url # https://download.lammps.org/thirdparty/opencl-loader-opencl-loadewer-version...tgz # then extract in _deps/ dir # and added -D GPU_LIBRARY=../lib/gpu/libgpu.a ala QUIP_LIBRARY # that works, cmake compile binary jumps on multiple gpus [hmeij@n35 sharptail]$ mpirun -n 2 \ /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp \ -suffix gpu -in in.colloid [root@greentail52 ~]# ssh n35 gpu-process gpu_name, gpu_id, pid, process_name Tesla K20m, 0, 9911, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp Tesla K20m, 1, 9912, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp # some stats, colloid example 1 cpu, 1 gpu Total wall time: 0:05:49 2 cpus, 2 gpus Total wall time: 0:03:58 4 cpus, 4 gpus Total wall time: 0:02:23 8 cpus, 4 gpus Total wall time: 0:02:23 # but the ML-PACE hash error is different, so no go there