This is an old revision of the document!
Upgrading to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo
For legacy hardware find the latest legacy driver here
Then download the driver series selected here
Then download the latest toolkit supported (11.2) for K20m on centos7
# install drivers (uninstalls existing drivers, accept defaults) cd /usr/local/src DRIVER_VERSION=470.199.02 curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run sh ./NVIDIA-Linux-x86_64-470.199.02.run # install toolkit wget https://developer.download.nvidia.com/compute/cuda/11.2.1/local_installers/cuda_11.2.1_460.32.03_linux.run sh cuda_11.2.1_460.32.03_linux.run # prompts │ CUDA Installer │ │ - [X] Driver │ │ [X] 460.32.03 │ │ + [X] CUDA Toolkit 11.2 │ │ [ ] CUDA Samples 11.2 │ │ [ ] CUDA Demo Suite 11.2 │ │ [ ] CUDA Documentation 11.2 │ │ Options │ │ Install # update link to this version: yes =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-11.2/ Samples: Not Selected Please make sure that - PATH includes /usr/local/cuda-11.2/bin - LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Logfile is /var/log/cuda-installer.log # no nvidia_modprobe? nope, don't know where drivers are not in /dev/nvidia? reboot # some time later nvidia drivers show up... # then export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH [hmeij@n33 ~]$ nvidia-smi Tue Sep 5 14:43:15 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K20m On | 00000000:02:00.0 Off | 0 | | N/A 26C P8 25W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m On | 00000000:03:00.0 Off | 0 | | N/A 27C P8 26W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m On | 00000000:83:00.0 Off | 0 | | N/A 25C P8 24W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m On | 00000000:84:00.0 Off | 0 | | N/A 26C P8 25W / 225W | 0MiB / 4743MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ # startup slurm, finds gpus? yes # old compiled code compatible? test
Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with
For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd.
# from slurmd.log [2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4 JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON) 1053052 mwgpu test hmeij R 0:09 1 8 0 n33 [hmeij@cottontail2 slurm]$ ssh n33 gpu-info id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem 0, Tesla K20m, 36, 95 MiB, 4648 MiB, 100 %, 25 % 1, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 % 2, Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 % 3, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 % [hmeij@cottontail2 slurm]$ ssh n33 gpu-process gpu_name, gpu_id, pid, process_name Tesla K20m, 0, 28394, pmemd.cuda