User Tools

Site Tools


cluster:223

This is an old revision of the document!



Back

cuda toolkit

Upgrading to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo

For legacy hardware find the latest legacy driver here

Then download the driver series selected here

Then download the latest toolkit supported (11.2) for K20m on centos7

# install drivers (uninstalls existing drivers, accept defaults)

cd /usr/local/src
DRIVER_VERSION=470.199.02
curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

sh ./NVIDIA-Linux-x86_64-470.199.02.run

# install toolkit

wget https://developer.download.nvidia.com/compute/cuda/11.2.1/local_installers/cuda_11.2.1_460.32.03_linux.run

sh cuda_11.2.1_460.32.03_linux.run

# prompts

│ CUDA Installer                                                               │
│ - [X] Driver                                                                 │
│      [X] 460.32.03                                                           │
│ + [X] CUDA Toolkit 11.2                                                      │
│   [ ] CUDA Samples 11.2                                                      │
│   [ ] CUDA Demo Suite 11.2                                                   │
│   [ ] CUDA Documentation 11.2                                                │
│   Options                                                                    │
│   Install    

# update link to this version: yes

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-11.2/
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-11.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log


# no nvidia_modprobe? nope, don't know where drivers are not in /dev/nvidia

reboot

# then

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

[hmeij@n33 ~]$ nvidia-smi
Tue Sep  5 14:43:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:02:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 00000000:03:00.0 Off |                    0 |
| N/A   27C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   25C    P8    24W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          On   | 00000000:84:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# startup slurm, finds gpus? yes

# old compiled code compatible? test

Test

Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with

  • #SBATCH -N 1
  • #SBATCH -n 1
  • #SBATCH -B 1:1:1
  • #SBATCH –mem-per-gpu=7168

For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. The other oddity is Slurm finds the nvidia drivers but where are they? The gref.conf settings work. Odd. Yet pmemd.cuda is running on gpu ID 0.

# from slurmd.log
[2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4

JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
1053052 mwgpu             test         hmeij   R          0:09     1     8             0              n33

[hmeij@cottontail2 slurm]$ grep k20 /usr/local/slurm/etc/gres.conf
NodeName=n[33-36]  Name=gpu Type=tesla_k20m          File=/dev/nvidia[0-3]
[hmeij@cottontail2 slurm]$ ls -l /dev/nvidia?
ls: cannot access '/dev/nvidia?': No such file or directory

[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, Tesla K20m, 36, 95 MiB, 4648 MiB, 100 %, 25 %
1, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %
2, Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 %
3, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %

[hmeij@cottontail2 slurm]$ ssh n33 gpu-process
gpu_name, gpu_id, pid, process_name
Tesla K20m, 0, 28394, pmemd.cuda

# and hours later, same command, shows drivers

[root@n33 ~]# ls -l /dev/nvidia?
crw-rw-rw- 1 root root 195, 0 Sep  5 14:39 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Sep  5 14:39 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Sep  5 14:39 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Sep  5 14:39 /dev/nvidia3

# okidoki then


Back

cluster/223.1694033162.txt.gz · Last modified: 2023/09/06 16:46 by hmeij07