\\ **[[cluster:0|Back]]** Slurm links: * https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf * https://devblogs.nvidia.com/how-to-run-ngc-deep-learning-containers-with-singularity/ * https://devblogs.nvidia.com/automating-downloads-ngc-container-replicator/ * https://devblogs.nvidia.com/docker-compatibility-singularity-hpc/ Other useful links. * https://www.nvidia.com/en-us/gpu-cloud/containers/ * https://docs.nvidia.com/ngc/ngc-user-guide/index.html * scheduler wrapper, inside container: NV_GPU=2,3 nvidia-docker run ... * (container sees host gpu 2,3 as container gpu 0,1) * https://ngc.nvidia.com/catalog/containers/ * https://blog.exxactcorp.com/installing-using-docker-nv-docker-centos-7/ * https://github.com/nvidia/nvidia-container-runtime#nvidia_visible_devices * nvidia? or cuda_visible... ==== NGC Docker Containers ==== Trying to understand how to leverage GPU ready applications on the Nvidia NGC web site (Nvidia GPU Cloud). Download docker containers and build your own on premise catalog. Run GPU ready software on compute nodes with docker containers. Can't wrap myself around the problem of how to integrate containers with the our scheduler yet. * https://blog.exxactcorp.com/installing-using-docker-nv-docker-centos-7/ # get docker on centos 7 curl -fsSL https://get.docker.com/ -o get-docker.sh ./getdocker.sh # systemctl systemctl enable docker systemctl start docker # dockeruser, usermod change adduser dockeruser usermod -aG docker dockeruser # get nvidia-docker # wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm wget https://github.com/NVIDIA/nvidia-docker/archive/v2.2.2.tar.gz # rpm -i /tmp/nvidia-docker*.rpm # make nvidia-docker # systemctl systemctl enable nvidia-docker systemctl start nvidia-docker systemctl status docker systemctl status nvidia-docker # fetch image and run command in container # then remove container, image remains nvidia-docker run --rm nvidia/cuda nvidia-smi # or docker pull nvidia/cuda Pull down other containers, for example from Nvidia Catalog Register (nvcr.io) * https://ngc.nvidia.com/catalog/containers?orderBy=modifiedDESC&query=&quickFilter=containers&filters= NGC Deep Learning Ready Docker Containers: NVIDIA DIGITS - nvcr.io/nvidia/digits TensorFlow - nvcr.io/nvidia/tensorflow Caffe - nvcr.io/nvidia/caffe NVIDIA CUDA - nvcr.io/nvidia/cuda (9.2, 10.1, 10.0) PyTorch - nvcr.io/nvidia/pytorch RapidsAI - nvcr.io/nvidia/rapidsai/rapidsai Additional Docker Images: Portainer Docker Management - portrainer/portainer # in the catalog you can also find docker pull nvcr.io/hpc/gromacs:2018.2 docker pull nvcr.io/hpc/lammps:24Oct2018 docker pull nvcr.io/hpc/namd:2.13-multinode docker pull nvcr.io/partners/matlab:r2019b # not all at the latest versions as you can see # and amber would have to be custom build on top of nvidia/cuda Make GPUs available to container and set some settings # DIGITS example # if you passed host GPU ID 2,3 the container would still see the GPUs as ID 0,1 NV_GPU=0,1 nvidia-docker run --name digits -d -p 5000:5000 nvidia/digits # list containers running nvidia-docker ps There are some other issues... * inside the container the user invoked application runs as root so copying files back and forth is a problem * file systems, home directory and scratch spaces need to be mounted inside container * GPUs need to be reserved via scheduler on a host then made available to container (see above) Some notes from https://docs.nvidia.com/ngc/ngc-user-guide/index.html # NGC containers are hosted in a repository called nvcr.io # A Docker container is the running instance of a Docker image. # All NGC Container images are based on the CUDA platform layer (nvcr.io/nvidia/cuda) # mount host directory to container location -v $HOME:/tmp/$USER # pull images docker pull nvcr.io/hpc/namd:2.1 docker images # detailed information of container /workspace/README.md # specifying a user -u $(id -u):$(id -g) # allocate GPUs NV_GPU=0,1 nvidia-docker run ... # custom build images ... # looks complex based on Dockerfile config file commands # see link \\ **[[cluster:0|Back]]**