DokuWiki

Cuda

Upgrading Cuda to latest drivers and tooltkit that supports our GeForce RTX 2080 SUPER (and Ti) gpu models (queues exx96 and amber128). Before we embark doing all nodes, we need to test backward compatibility and assess how troublesome the upgrade might be.

First look at what is end-of-life these days … middle of page go to table table GPUs supported, scroll down keeping an eye on Turing and Lovelace columns. Cuda version 10.2 (our RTX2080S) is still supported in version 12.4 so we can install latest toolkit and drivers.

https://en.wikipedia.org/wiki/CUDA

And it looks like Tesla P4, P100 and V100 are next on the end of extended life list, consult table “A. End-of-Software-Support Dates for GPUs Supported by NVIDIA vGPU Software”

https://docs.nvidia.com/grid/news/vgpu-software-lifecycle-on-supported-gpus/index.html

Find the drivers at this page (non-legacy gpu models as opposed to legacy models cuda toolkit), look for link Linux x86_64/AMD64/EM64T Latest Production Branch Version: 550.67

https://www.nvidia.com/en-us/drivers/unix/

Linux x86_64/AMD64/EM64T
Latest Production Branch Version: 550.67
Supported Products Tab
(make sure the models are listed – they are)

https://www.nvidia.com/Download/driverResults.aspx/223426/en-us/

Agree&Download the NVIDIA-Linux-x86_64-550.67.run driver file.

Now obtain the latest toolkit as our gpu models are supported as verified above

https://developer.nvidia.com/cuda-downloads
Select: Linux > x86_64 > CentOS > 7 > Runfile (local)
I see they have Rocky support too…

and you will get instructions

# centos 7
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

# rocky 8 (same file)
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

The complete Linux Installer Guide is at

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Installation

On CentOS 7 hop on vlan52, yum update followed by reboot (EOL 06/30/2024)
- skip; libnvidia-container repo has problems and may be crucial
On Rocky 8
- yum update; skip this step for now …
Make both files executable (same files for both)
Install driver first (Nvidia-Linux-*.run, reboot)
Then install toolkit (cuda-12.4-*.run, reboot)

# n78 first ... can reimage cuda-11.6 from n101 (no problem, tests success)
# make sure /usr/src/kernels/$(uname -r) exists else 
# scp into place from n100 (centos8, possibly caused by warewulf...)
# however old nvidia packages still in OS (driver 510 toolkit 11.6)...
#  rpm -qa | grep ^nvidia | wc -l results in 16 packages...
# what happens on dfn check-update ???

# n[100-101] skipping for now
# this is a package install and there is no nvidia_uninstall (runfile)
# an upgrade would require internet 'dnf check-update; dnf update')
# switching between rpm install and runfile is NOT recommended
# and 'dnf erase nvidia*' may leave a hung system behind

# n79 next (no problem)
# but during testing found this error which I had seen BEFORE cuda install
reason:         BUG: unable to handle kernel NULL pointer dereference at 0000000000000108
# follow up, nvidia driver "taints" the kernel by loading proprietary drive
# this is a warning mostly by may interfere as does docker likely
https://unix.stackexchange.com/questions/118116/what-is-a-tainted-linux-kernel
# disabled docker on n79 for now 04/15/2024 9:06AM
# also rotated the memory dimms some time later, seems to have fixed issue
# started docker back up on n79 05/06/2024 9:56AM (has been up 17 days by now)

# n89 next (no problem)
# but upon reboot I encountered that error for the FIRST time on this node
# need to research it is somewhat related to cuda install
# n80 (same error upon reboot after driver install)
# n81 (same error upon reboot after driver install)
# n90 (same error upon reboot after toolkit install, not driver. weird)
# n88 (failed toolkit install, ran /usr/bin/ndia-uninstall, reboot
#      re-installed driver, reboot, re-installed tookit, reboot, 
#      no error occurs! )
# n87 (ran nvidia-uninstall first, driver then toolkit, errors shows up)
# n86 & n85 same as n87
# n84 (no error shows up)
# n82 (as n87 but error shows up after driver before toolkit install)
# n83 (error shows up fater toolkit install, not driver install reboot)

sh ./NVIDIA-Linux-x86_64-550.67.run

32 bit compat? no
dkms build? yes
rebuild initramfs? yes
xconfig? no
error nvidia module can not be loaded
reboot fixed that

[root@n78 ~]# nvidia-smi
Mon Apr  1 14:50:33 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:02:00.0 Off |                  N/A |
| 19%   38C    P0             59W /  250W |       0MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:03:00.0 Off |                  N/A |
| 16%   25C    P0             58W /  250W |       0MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:81:00.0 Off |                  N/A |
| 18%   28C    P0             59W /  250W |       0MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:82:00.0 Off |                  N/A |
| 19%   29C    P0             58W /  250W |       0MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


rm -f /usr/local/cuda # soft link to current version
/usr/bin/nvidia_uninstall  # probably not needed, made no difference > Continue

sh cuda_12.4.0_550.54.14_linux.run

REBOOT and check date before launching slurm
mv /var/spool/slurmd/cred_state /var/spool/slurmd/cred_state.bak

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.4/

Please make sure that
 -   PATH includes /usr/local/cuda-12.4/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.4/lib64, or, add /usr/local/cuda-12.4/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.4/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

# check /usr/local/cuda link, clean up others

[root@n78 ~]# export PATH=/usr/local/cuda/bin:$PATH
[root@n78 ~]# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
[root@n78 ~]# which nvidia-smi
/usr/bin/nvidia-smi

[root@n78 ~]# which nvcc
/usr/local/cuda/bin/nvcc

[root@n78 ~]# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

# slurm finds gpus? yes...

[2024-01-18T12:46:29.322] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2024-01-18T12:46:29.322] Gres Name=gpu Type=geforce_gtx_1080_ti Count=4

Testing

Rocky 8 on n78

Previous software compiled using OpenHPC modules pulls in cuda/11.6 module. So lets test if that works with new 550 driver and 11.2 toolkit. You can follow these steps to test any other software.

# ssh to compute node and load module
[hmeij@n78 ~]$ module load amber/22

# identify object of interest
[hmeij@n78 ~]$ which pmemd.cuda
/share/apps/CENTOS8/ohpc/software/amber/22/bin/pmemd.cuda

# feed to ldd, look for missing libraries

[hmeij@n78 ~]$ ldd `which pmemd.cuda`
        linux-vdso.so.1 (0x00007ffeb51ea000)
        libemil.so => /share/apps/CENTOS8/ohpc/software/amber/22//lib/libemil.so (0x00007f99ae5aa000)
        libnetcdff.so.6 => /share/apps/CENTOS8/ohpc/software/amber/22//lib/libnetcdff.so.6 (0x00007f99ae315000)
        libkmmd.so => /share/apps/CENTOS8/ohpc/software/amber/22//lib/libkmmd.so (0x00007f99ae10a000)
        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007f99a567d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f99a5479000)
        libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f999f033000)
        libcublas.so.11 => /usr/local/cuda/lib64/libcublas.so.11 (0x00007f99955bf000)
        libnetcdf.so.15 => /lib64/libnetcdf.so.15 (0x00007f9995273000)
        libarpack.so => /share/apps/CENTOS8/ohpc/software/amber/22//lib/libarpack.so (0x00007f9995056000)
        libopenblas.so.0 => /lib64/libopenblas.so.0 (0x00007f999328e000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9992f0c000)
<snip>

# does it work?

[hmeij@n78 ~]$ pmemd.cuda --version
pmemd.cuda: Version 22.0

So you'll notice libraries are pulled in from …/amber/22/… and …/cuda/lib64/…. The latter is technically wrong because that generic path points to 12.4 now not 11.6. The problem was the module did not specify the absolute path to cuda-11.6. Fixed that and provided soft links for individual libraries in cuda-12.4. Not the way to do it but it works. So toolkit 11.6 is compatible with 12.4 using the newly installed 550 driver. Amber 20/22 and Lammps 25Apr2023/7Feb2024 all work. See script ~hmeij/slurm/run.rocky.2.

pmemd.cuda: error while loading shared libraries: libcufft.so.10
...argh, Amber links in specific version of libraries

# not the way to do things...amber 20&22
  cd /usr/local/cuda-12.4/targets/x86_64-linux/lib/
  ln -s /usr/local/cuda-11.6/lib64/libcufft.so.10 
  ln -s /usr/local/cuda-11.6/lib64/libcublas.so.11
  ln -s /usr/local/cuda-11.6/lib64/libcublasLt.so.11

# not the way to do things...lammps 7Feb2024 and 25Apr2023
  ln -s /usr/local/cuda-11.6/lib64/libcudart.so.11.0

CentOS 7 on n89

The steps above can also be done for the default cuda installation on exx96 where the soft link /usr/local/bin/cuda would have pointed to /usr/local/bin/cuda-10.2. Do not follow the soft link and use the path with the toolkit version in it when setting your cuda environment.

Next test is to see if older software runs compatible with newer drivers. We test that by running a gpu program against new 550 driver and cuda toolkit 9.2 and see if it works (~hmeij/slurm/run.centos7.2).

# same steps as above using n89
# set up the environment the script uses

[hmeij@n89 ~]$ which nvcc
/usr/local/n37-cuda-9.2/bin/nvcc

[hmeij@n89 ~]$ source /usr/local/amber20/amber.sh 

[hmeij@n89 ~]$ which pmemd.cuda
/usr/local/amber20/bin/pmemd.cuda

[hmeij@n89 ~]$ pmemd.cuda --version
pmemd.cuda: Version 20.0

# going further back in time

[hmeij@n89 ~]$ source /usr/local/amber16/amber.sh 

[hmeij@n89 ~]$ which pmemd.cuda
/usr/local/amber16/bin/pmemd.cuda

[hmeij@n89 ~]$ pmemd.cuda --version
pmemd.cuda: Version 14.0

So it appears that all our cuda versions can use the new 550 driver that comes with cuda-12.4 toolkit. Two other cuda versions have not been tested but should function as well (cuda-11.2 in mwgpu and cuda-10.2 in exx96). But in these queues cuda-9.2 is present on local hard disk and software was compiled against that toolkit so which queue to use did not matter. (compilations against 10.2 did not run in 9.2, as expected). I was able to test and run lammps in 10.2 consult the file ~hmeij/slurm/centos.2

These compatibility results are way, way better than expected. Yea.

Any new software compilations will use module cuda/12.4.

Back

Table of Contents

Cuda

Installation

Testing