DokuWiki

cuda toolkit

Upgrading Cuda to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo

For legacy hardware find the latest legacy driver here

https://www.nvidia.com/en-us/drivers/unix/legacy-gpu/

Then download the driver series selected here

https://www.nvidia.com/en-us/drivers/unix/

Then download the latest toolkit supported (11.2) for K20m on centos7

https://developer.nvidia.com/cuda-11.2.1-download-archive?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=runfilelocal

# install drivers (uninstalls existing drivers, accept defaults)

cd /usr/local/src
DRIVER_VERSION=470.199.02
curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

sh ./NVIDIA-Linux-x86_64-470.199.02.run

# install toolkit

wget https://developer.download.nvidia.com/compute/cuda/11.2.1/local_installers/cuda_11.2.1_460.32.03_linux.run

sh cuda_11.2.1_460.32.03_linux.run

# prompts

│ CUDA Installer                                                               │
│ - [X] Driver                                                                 │
│      [X] 460.32.03                                                           │
│ + [X] CUDA Toolkit 11.2                                                      │
│   [ ] CUDA Samples 11.2                                                      │
│   [ ] CUDA Demo Suite 11.2                                                   │
│   [ ] CUDA Documentation 11.2                                                │
│   Options                                                                    │
│   Install    

# update link to this version: yes
# no -silent -driver ...

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-11.2/
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-11.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log


# no nvidia_modprobe? ls -l /dev/nvidia?

reboot

# then

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

[hmeij@n33 ~]$ nvidia-smi
Tue Sep  5 14:43:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:02:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 00000000:03:00.0 Off |                    0 |
| N/A   27C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   25C    P8    24W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          On   | 00000000:84:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# startup slurm, finds gpus? yes

# old compiled code compatible? test

Testing

Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -B 1:1:1
#SBATCH –mem-per-gpu=7168

For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. Good news is Amber runs fine, no need to recompile.

# from slurmd.log
[2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4

JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
1053052 mwgpu             test         hmeij   R          0:09     1     8             0              n33

[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, Tesla K20m, 36, 95 MiB, 4648 MiB, 100 %, 25 %
1, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %
2, Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 %
3, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %

[hmeij@cottontail2 slurm]$ ssh n33 gpu-process
gpu_name, gpu_id, pid, process_name
Tesla K20m, 0, 28394, pmemd.cuda

#SBATCH –cpus-per-gpu=1

Adding this does force Slurm to allocate just a single cpu. Now try 4 gpu jobs per node. No need for CUDA_VISIBLE_DEVICES setting.

JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
1053992 mwgpu             test         hmeij   R          0:04     1     1             0              n33

[hmeij@cottontail2 slurm]$ for i in `seq 1 6`; do sbatch run.centos; sleep 30; squeue | grep hmeij; done

# output
Submitted batch job 1054000
1054000 mwgpu             test         hmeij   R          0:30     1     1             0              n33
Submitted batch job 1054001
1054001 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          1:00     1     1             0              n33
Submitted batch job 1054002
1054002 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054001 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054000 mwgpu             test         hmeij   R          1:30     1     1             0              n33
Submitted batch job 1054003
1054003 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054002 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054001 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          2:00     1     1             0              n33
Submitted batch job 1054004
1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
1054003 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054002 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054001 mwgpu             test         hmeij   R          2:00     1     1             0              n33
1054000 mwgpu             test         hmeij   R          2:30     1     1             0              n33
Submitted batch job 1054005
1054005 mwgpu             test         hmeij  PD          0:00     1     1             0(Nodes required f
1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
1054003 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054002 mwgpu             test         hmeij   R          2:00     1     1             0              n33
1054001 mwgpu             test         hmeij   R          2:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          3:00     1     1             0              n33


[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, Tesla K20m, 40, 95 MiB, 4648 MiB, 100 %, 25 %
1, Tesla K20m, 40, 95 MiB, 4648 MiB, 94 %, 23 %
2, Tesla K20m, 35, 95 MiB, 4648 MiB, 93 %, 21 %
3, Tesla K20m, 28, 95 MiB, 4648 MiB, 97 %, 25 %

Other software does need to be recompiled as it links to specific version of libraries rather than the generic libName.so (lammps).

Script ~hmeij/slurm/run.centos.lammps, setup env, get help page.

/share/apps/intel/parallel_studio_xe_2016_update3/compilers_and_libraries_2016.3.210/linux/bin/intel64/ifort
/usr/local/cuda/bin/nvcc
/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
/share/apps/CENTOS7/python/3.8.3/bin/python
/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/lmp_mpi-cuda-single-single
        linux-vdso.so.1 =>  (0x00007ffd714ec000)
        libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe443b9a000)
        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fe44390b000)
        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007fe442223000)
        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe436a74000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe436870000)
        libmpi.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libmpi.so.40 (0x00007fe43655b000)
        libstdc++.so.6 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libstdc++.so.6 (0x00007fe4361d9000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fe435ed7000)
        libgcc_s.so.1 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libgcc_s.so.1 (0x00007fe435cc0000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe435aa4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fe4356d6000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe443def000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fe4354ce000)
        libopen-rte.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-rte.so.40 (0x00007fe435218000)
        libopen-pal.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-pal.so.40 (0x00007fe434f09000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fe434d06000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fe434af0000)

Large-scale Atomic/Molecular Massively Parallel Simulator - 28 Mar 2023 - Development

Usage example: lmp_mpi-cuda-single-single -var t 300 -echo screen -in in.alloy

List of command line options supported by this LAMMPS executable:
<snip>

# hmmm, using -suffix gpu it does not jump on gpus, generic non-gpu libthread error
# same version rocky8/cuda-11.6 works, centos7/cuda-10.2 works, all "make" compiles
# try "cmake" compile on n33-n36 
# libspace tarball download fails on file hash and 
# yields a  status: [1;"Unsupported protocol"]  error for ML-PACE

# without ML-SPACE hash fails for opencl-loarder third partty, bad url
# https://download.lammps.org/thirdparty/opencl-loader-opencl-loadewer-version...tgz
# then extract in _deps/ dir
# and added -D GPU_LIBRARY=../lib/gpu/libgpu.a ala QUIP_LIBRARY
# that works, cmake compile binary jumps on multiple gpus


[hmeij@n35 sharptail]$ mpirun -n 2 \
/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp \
-suffix gpu -in in.colloid 

[root@greentail52 ~]# ssh n35 gpu-process
gpu_name, gpu_id, pid, process_name
Tesla K20m, 0, 9911, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
Tesla K20m, 1, 9912, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp

# some stats, colloid example

1 cpu, 1 gpu
Total wall time: 0:05:49
2 cpus, 2 gpus
Total wall time: 0:03:58
4 cpus, 4 gpus
Total wall time: 0:02:23
8 cpus, 4 gpus
Total wall time: 0:02:23

# but the ML-PACE hash  error is different, so no go there

Back