\\
**[[cluster:0|Back]]**
Slurm links:
* https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf
* https://devblogs.nvidia.com/how-to-run-ngc-deep-learning-containers-with-singularity/
* https://devblogs.nvidia.com/automating-downloads-ngc-container-replicator/
* https://devblogs.nvidia.com/docker-compatibility-singularity-hpc/
Other useful links.
* https://www.nvidia.com/en-us/gpu-cloud/containers/
* https://docs.nvidia.com/ngc/ngc-user-guide/index.html
* scheduler wrapper, inside container: NV_GPU=2,3 nvidia-docker run ...
* (container sees host gpu 2,3 as container gpu 0,1)
* https://ngc.nvidia.com/catalog/containers/
* https://blog.exxactcorp.com/installing-using-docker-nv-docker-centos-7/
* https://github.com/nvidia/nvidia-container-runtime#nvidia_visible_devices
* nvidia? or cuda_visible...
==== NGC Docker Containers ====
Trying to understand how to leverage GPU ready applications on the Nvidia NGC web site (Nvidia GPU Cloud). Download docker containers and build your own on premise catalog. Run GPU ready software on compute nodes with docker containers. Can't wrap myself around the problem of how to integrate containers with the our scheduler yet.
* https://blog.exxactcorp.com/installing-using-docker-nv-docker-centos-7/
# get docker on centos 7
curl -fsSL https://get.docker.com/ -o get-docker.sh
./getdocker.sh
# systemctl
systemctl enable docker
systemctl start docker
# dockeruser, usermod change
adduser dockeruser
usermod -aG docker dockeruser
# get nvidia-docker
# wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
wget https://github.com/NVIDIA/nvidia-docker/archive/v2.2.2.tar.gz
# rpm -i /tmp/nvidia-docker*.rpm
# make nvidia-docker
# systemctl
systemctl enable nvidia-docker
systemctl start nvidia-docker
systemctl status docker
systemctl status nvidia-docker
# fetch image and run command in container
# then remove container, image remains
nvidia-docker run --rm nvidia/cuda nvidia-smi
# or
docker pull nvidia/cuda
Pull down other containers, for example from Nvidia Catalog Register (nvcr.io)
* https://ngc.nvidia.com/catalog/containers?orderBy=modifiedDESC&query=&quickFilter=containers&filters=
NGC Deep Learning Ready Docker Containers:
NVIDIA DIGITS - nvcr.io/nvidia/digits
TensorFlow - nvcr.io/nvidia/tensorflow
Caffe - nvcr.io/nvidia/caffe
NVIDIA CUDA - nvcr.io/nvidia/cuda (9.2, 10.1, 10.0)
PyTorch - nvcr.io/nvidia/pytorch
RapidsAI - nvcr.io/nvidia/rapidsai/rapidsai
Additional Docker Images:
Portainer Docker Management - portrainer/portainer
# in the catalog you can also find
docker pull nvcr.io/hpc/gromacs:2018.2
docker pull nvcr.io/hpc/lammps:24Oct2018
docker pull nvcr.io/hpc/namd:2.13-multinode
docker pull nvcr.io/partners/matlab:r2019b
# not all at the latest versions as you can see
# and amber would have to be custom build on top of nvidia/cuda
Make GPUs available to container and set some settings
# DIGITS example
# if you passed host GPU ID 2,3 the container would still see the GPUs as ID 0,1
NV_GPU=0,1 nvidia-docker run --name digits -d -p 5000:5000 nvidia/digits
# list containers running
nvidia-docker ps
There are some other issues...
* inside the container the user invoked application runs as root so copying files back and forth is a problem
* file systems, home directory and scratch spaces need to be mounted inside container
* GPUs need to be reserved via scheduler on a host then made available to container (see above)
Some notes from https://docs.nvidia.com/ngc/ngc-user-guide/index.html
# NGC containers are hosted in a repository called nvcr.io
# A Docker container is the running instance of a Docker image.
# All NGC Container images are based on the CUDA platform layer (nvcr.io/nvidia/cuda)
# mount host directory to container location
-v $HOME:/tmp/$USER
# pull images
docker pull nvcr.io/hpc/namd:2.1
docker images
# detailed information of container
/workspace/README.md
# specifying a user
-u $(id -u):$(id -g)
# allocate GPUs
NV_GPU=0,1 nvidia-docker run ...
# custom build images ...
# looks complex based on Dockerfile config file commands
# see link
\\
**[[cluster:0|Back]]**