User Tools

Site Tools


cluster:223


Back

cuda toolkit

Upgrading Cuda to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo

For legacy hardware find the latest legacy driver here

Then download the driver series selected here

Then download the latest toolkit supported (11.2) for K20m on centos7

# install drivers (uninstalls existing drivers, accept defaults)

cd /usr/local/src
DRIVER_VERSION=470.199.02
curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

sh ./NVIDIA-Linux-x86_64-470.199.02.run

# install toolkit

wget https://developer.download.nvidia.com/compute/cuda/11.2.1/local_installers/cuda_11.2.1_460.32.03_linux.run

sh cuda_11.2.1_460.32.03_linux.run

# prompts

│ CUDA Installer                                                               │
│ - [X] Driver                                                                 │
│      [X] 460.32.03                                                           │
│ + [X] CUDA Toolkit 11.2                                                      │
│   [ ] CUDA Samples 11.2                                                      │
│   [ ] CUDA Demo Suite 11.2                                                   │
│   [ ] CUDA Documentation 11.2                                                │
│   Options                                                                    │
│   Install    

# update link to this version: yes
# no -silent -driver ...

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-11.2/
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-11.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log


# no nvidia_modprobe? ls -l /dev/nvidia?

reboot

# then

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

[hmeij@n33 ~]$ nvidia-smi
Tue Sep  5 14:43:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:02:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 00000000:03:00.0 Off |                    0 |
| N/A   27C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   25C    P8    24W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          On   | 00000000:84:00.0 Off |                    0 |
| N/A   26C    P8    25W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# startup slurm, finds gpus? yes

# old compiled code compatible? test

Testing

Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with

  • #SBATCH -N 1
  • #SBATCH -n 1
  • #SBATCH -B 1:1:1
  • #SBATCH –mem-per-gpu=7168

For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. Good news is Amber runs fine, no need to recompile.

# from slurmd.log
[2023-09-05T14:51:00.691] Gres Name=gpu Type=tesla_k20m Count=4

JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
1053052 mwgpu             test         hmeij   R          0:09     1     8             0              n33

[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, Tesla K20m, 36, 95 MiB, 4648 MiB, 100 %, 25 %
1, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %
2, Tesla K20m, 25, 0 MiB, 4743 MiB, 0 %, 0 %
3, Tesla K20m, 26, 0 MiB, 4743 MiB, 0 %, 0 %

[hmeij@cottontail2 slurm]$ ssh n33 gpu-process
gpu_name, gpu_id, pid, process_name
Tesla K20m, 0, 28394, pmemd.cuda
  • #SBATCH –cpus-per-gpu=1

Adding this does force Slurm to allocate just a single cpu. Now try 4 gpu jobs per node. No need for CUDA_VISIBLE_DEVICES setting.

JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
1053992 mwgpu             test         hmeij   R          0:04     1     1             0              n33

[hmeij@cottontail2 slurm]$ for i in `seq 1 6`; do sbatch run.centos; sleep 30; squeue | grep hmeij; done

# output
Submitted batch job 1054000
1054000 mwgpu             test         hmeij   R          0:30     1     1             0              n33
Submitted batch job 1054001
1054001 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          1:00     1     1             0              n33
Submitted batch job 1054002
1054002 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054001 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054000 mwgpu             test         hmeij   R          1:30     1     1             0              n33
Submitted batch job 1054003
1054003 mwgpu             test         hmeij   R          0:30     1     1             0              n33
1054002 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054001 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          2:00     1     1             0              n33
Submitted batch job 1054004
1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
1054003 mwgpu             test         hmeij   R          1:00     1     1             0              n33
1054002 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054001 mwgpu             test         hmeij   R          2:00     1     1             0              n33
1054000 mwgpu             test         hmeij   R          2:30     1     1             0              n33
Submitted batch job 1054005
1054005 mwgpu             test         hmeij  PD          0:00     1     1             0(Nodes required f
1054004 mwgpu             test         hmeij  PD          0:00     1     1             0      (Resources)
1054003 mwgpu             test         hmeij   R          1:30     1     1             0              n33
1054002 mwgpu             test         hmeij   R          2:00     1     1             0              n33
1054001 mwgpu             test         hmeij   R          2:30     1     1             0              n33
1054000 mwgpu             test         hmeij   R          3:00     1     1             0              n33


[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, Tesla K20m, 40, 95 MiB, 4648 MiB, 100 %, 25 %
1, Tesla K20m, 40, 95 MiB, 4648 MiB, 94 %, 23 %
2, Tesla K20m, 35, 95 MiB, 4648 MiB, 93 %, 21 %
3, Tesla K20m, 28, 95 MiB, 4648 MiB, 97 %, 25 %

Other software does need to be recompiled as it links to specific version of libraries rather than the generic libName.so (lammps).

Script ~hmeij/slurm/run.centos.lammps, setup env, get help page.

/share/apps/intel/parallel_studio_xe_2016_update3/compilers_and_libraries_2016.3.210/linux/bin/intel64/ifort
/usr/local/cuda/bin/nvcc
/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
/share/apps/CENTOS7/python/3.8.3/bin/python
/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/lmp_mpi-cuda-single-single
        linux-vdso.so.1 =>  (0x00007ffd714ec000)
        libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe443b9a000)
        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fe44390b000)
        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007fe442223000)
        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe436a74000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe436870000)
        libmpi.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libmpi.so.40 (0x00007fe43655b000)
        libstdc++.so.6 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libstdc++.so.6 (0x00007fe4361d9000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fe435ed7000)
        libgcc_s.so.1 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libgcc_s.so.1 (0x00007fe435cc0000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe435aa4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fe4356d6000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe443def000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fe4354ce000)
        libopen-rte.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-rte.so.40 (0x00007fe435218000)
        libopen-pal.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-pal.so.40 (0x00007fe434f09000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fe434d06000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fe434af0000)

Large-scale Atomic/Molecular Massively Parallel Simulator - 28 Mar 2023 - Development

Usage example: lmp_mpi-cuda-single-single -var t 300 -echo screen -in in.alloy

List of command line options supported by this LAMMPS executable:
<snip>

# hmmm, using -suffix gpu it does not jump on gpus, generic non-gpu libthread error
# same version rocky8/cuda-11.6 works, centos7/cuda-10.2 works, all "make" compiles
# try "cmake" compile on n33-n36 
# libspace tarball download fails on file hash and 
# yields a  status: [1;"Unsupported protocol"]  error for ML-PACE

# without ML-SPACE hash fails for opencl-loarder third partty, bad url
# https://download.lammps.org/thirdparty/opencl-loader-opencl-loadewer-version...tgz
# then extract in _deps/ dir
# and added -D GPU_LIBRARY=../lib/gpu/libgpu.a ala QUIP_LIBRARY
# that works, cmake compile binary jumps on multiple gpus


[hmeij@n35 sharptail]$ mpirun -n 2 \
/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp \
-suffix gpu -in in.colloid 

[root@greentail52 ~]# ssh n35 gpu-process
gpu_name, gpu_id, pid, process_name
Tesla K20m, 0, 9911, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
Tesla K20m, 1, 9912, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp

# some stats, colloid example

1 cpu, 1 gpu
Total wall time: 0:05:49
2 cpus, 2 gpus
Total wall time: 0:03:58
4 cpus, 4 gpus
Total wall time: 0:02:23
8 cpus, 4 gpus
Total wall time: 0:02:23

# but the ML-PACE hash  error is different, so no go there 


Back

cluster/223.txt · Last modified: 2023/09/18 16:56 by hmeij07