\\
**[[cluster:0|Back]]**
===== Docker Containers Usage =====
Page build up from the bottom to top. We're not making a traditional "MPI" docker integration with our scheduler. We'll see what usage patterns will emergence and go from there. I can help with workflow. If more containers are desired please let me know which ones to ''pull''.
If users want to run web enabled applications in the container one simple workflow would be to submit a job that reserves a GPU then loops checking a lock file until removed. Then ssh to the node and access the application via ''firefox http://localhost:PORT/''. For example DIGITS and JupyterLab described below.
==== Readings ====
Interesting reads...
* https://www.stackhpc.com/k8s-mpi.html
* PMI(x), Slurm
* https://www.stackhpc.com/the-state-of-hpc-containers.html
* Docker, Kubernetes, Singularity, Shifter, CharleiCloud
* https://en.wikipedia.org/wiki/HAProxy
* HA load balancing with Docker images for CentOS
==== Scheduler Runs ====
Next add to the script the scheduler syntax that is needed. Request a gpu resource ''gpu4'', memory and submit jobs. Example is located at ''/home/hmeij/jobs/docker/run.docker''. Notice that we use ''nvidia-docker'' [[https://thenewstack.io/primer-nvidia-docker-containers-meet-gpus/|External Link]]. Nvidia-Docker is basically a wrapper around the docker CLI that transparently provisions a container with the necessary dependencies to execute code on the GPU.
#!/bin/bash
# submit via 'bsub < run.docker'
rm -f out err
#BSUB -e err
#BSUB -o out
#BSUB -q exx96
#BSUB -J "RTX2080S docker"
#BSUB -n 1
#BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]"
# should add a check we get an integer back in the 0-3 range
gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` "
echo ""; echo "running on gpu $HOSTNAME:$gpuid"; echo ""
NV_GPU=$gpuid \
nvidia-docker run --rm -u $(id -u):$(id -g) \
-v /$HOME:/mnt/$USER \
-v /home/apps:/mnt/apps \
-v /usr/local:/mnt/local \
nvcr.io/nvidia/tensorflow:19.09-py2 python \
/mnt/$USER/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--num_gpus=1 --batch_size=64 \
--model=resnet50 \
--variable_update=parameter_server
# or run_tests.py
To make the ''imports'' work edit that python file
import sys
sys.path.insert(0, '/mnt/hmeij/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/')
==== GPU Runs ====
We put the tensorflow benchmark example in a script. It will find a free gpu, set it in a environment variable ''NV_GPU'' and then run the tensorflow application on the GPU allocated.
# execute script
[hmeij@n79 docker]$ ./run.docker
running on gpu n79:3 <--------- this is the physical gpu id
# tensorflow starts up
================
== TensorFlow ==
================
NVIDIA Release 19.09 (build 8044706)
TensorFlow Version 1.14.0
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.
(snip output...)
# details
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0'] <--------- this is the logical gpu id
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
(snip output...)
# query what is running on gpus ... D8 is gpu 3 (ssh n79 nvidia-smi to verify)
[root@n79 ~]# gpu-process
gpu_name, gpu_bus_id, pid, process_name
GeForce RTX 2080 SUPER, 00000000:D8:00.0, 77747, python
==== Simple Runs ====
Some simple interactive test runs. Map your home directory from the host inside the container, I choose /mnt but it can go anywhere but /home ... Also set up your uid/gid because the container will run as "root" in user namespace.
# probe uid:gid inside container
[hmeij@n79 ~]$ nvidia-docker run --rm -v /$HOME:/mnt/$USER \
-u $(id -u):$(id -g) nvcr.io/nvidia/cuda:10.0-devel id
uid=8216 gid=623 groups=623
# probe home directory inside container
[hmeij@n79 ~]$ nvidia-docker run --rm -v /$HOME:/mnt/$USER \
-u $(id -u):$(id -g) nvcr.io/nvidia/cuda:10.0-devel df /mnt/hmeij
Filesystem 1K-blocks Used Available Use% Mounted on
10.10.102.42:/home/hmeij 10735331328 8764345344 1970985984 82% /mnt/hmeij
# create a file inside container
[hmeij@n79 ~]$ nvidia-docker run --rm -v /$HOME:/mnt/$USER \
-u $(id -u):$(id -g) nvcr.io/nvidia/cuda:10.0-devel \
touch /mnt/$USER/tmp/dockerout.stuff
# check permissions on host running the container
[hmeij@n79 ~]$ ls -l $HOME/tmp
total 232
-rw-r--r-- 1 hmeij its 0 Mar 2 14:49 dockerout.stuff
==== Pull Images ====
Pull more images from the Nvidia Gpu Cloud Catalog. There are also models. As you can tell, not all containers applications are up to date. Only pulled on node ''n79'', not expecting any usage. It is nice to pull esoteric software like the deep learning stack (digits, tensorflow, pytorch, caffe, rapidsai).
[root@n79 ~]# docker pull nvcr.io/hpc/gromacs:2018.2
2018.2: Pulling from hpc/gromacs
b234f539f7a1: Pull complete
55172d420b43: Pull complete
5ba5bbeb6b91: Pull complete
43ae2841ad7a: Pull complete
f6c9c6de4190: Pull complete
0555b970f65d: Pull complete
864a2d44e3fa: Pull complete
b6ff28d6c105: Pull complete
0c3c54b51c1e: Pull complete
7a424df9f371: Pull complete
27d1015dc8b9: Pull complete
63eaba578222: Pull complete
Digest: sha256:224b235a84516460930c52e398b9c986cc4e4680a03fd9b0880e6801fbe18773
Status: Downloaded newer image for nvcr.io/hpc/gromacs:2018.2
nvcr.io/hpc/gromacs:2018.2
# others
Status: Downloaded newer image for nvcr.io/hpc/lammps:24Oct2018
Status: Downloaded newer image for nvcr.io/hpc/namd:2.13-multinode
# Matlab image pull failed: authorization required
# Also needs a license for the container, skipping.
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/hpc/namd 2.13-multinode 089737115e76 12 months ago 369MB
nvcr.io/hpc/lammps 24Oct2018 3b02106cf9f3 15 months ago 332MB
nvcr.io/hpc/gromacs 2018.2 0c6acfceb224 19 months ago 1.09GB
==== JupyterLab ====
The Rapids container and Notebook Server hide in the ''rapidsai:cuda10.0'' container, also an interactive browser enabled application. Again, not sure how we would use it but here is how to start it. JupyterLab is a next-generation web-based user interface for Project Jupyter. [[https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html|External Link]]. Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. [[https://jupyter.org/|External Link]]
[hmeij@n79 ~]$ docker run --runtime=nvidia --rm -it \
-p 8888:8888 -p 8887:8887 -p 8886:8886 \
nvcr.io/nvidia/rapidsai/rapidsai:0.9-cuda10.0-runtime-centos7
(rapids) [root@8148861121b3 notebooks]# bash utils/start-jupyter.sh
jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=''
[W 14:52:11.127 LabApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.
[I 14:52:11.127 LabApp] The port 8888 is already in use, trying another port.
[I 14:52:11.595 LabApp] JupyterLab extension loaded from /opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab
[I 14:52:11.596 LabApp] JupyterLab application directory is /opt/conda/envs/rapids/share/jupyter/lab
[I 14:52:11.598 LabApp] Serving notebooks from local directory: /rapids/notebooks
[I 14:52:11.598 LabApp] The Jupyter Notebook is running at:
[I 14:52:11.598 LabApp] http://8148861121b3:8889/
[I 14:52:11.598 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Any user can connect once started. Here is what it looks like. Note: the ''--rm'' flags means stop and remove container when done.
{{:cluster:jupyterlab.png?nolink&800|}}
==== Digits ====
The NVIDIA Deep Learning GPU Training System (DIGITS) puts the power of deep learning into the hands of engineers and data scientists. DIGITS can be used to rapidly train the highly accurate deep neural network (DNNs) for image classification, segmentation and object detection tasks. [[https://developer.nvidia.com/digits|External Link]]
DATE=$( date +%N ) # nanoseconds unique id
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="0,1,2,3" \
--name digits-$DATE-0 -d -p 5000:5000 \
-v /data/datasets:/opt/datasets --restart=always \
nvcr.io/nvidia/digits:19.09-tensorflow
68e539718df3f13c41cdecd3f10e39b5cdecccab7036a64bb5d2ac9cdd133819
[hmeij@n79 ~]$ docker ps -l --no-trunc
CONTAINER ID
IMAGE
COMMAND
CREATED
STATUS
PORTS
NAMES
68e539718df3f13c41cdecd3f10e39b5cdecccab7036a64bb5d2ac9cdd133819
nvcr.io/nvidia/digits:19.09-tensorflow
"/usr/local/bin/nvidia_entrypoint.sh python -m digits"
2 minutes ago
Up About a minute
5000/tcp, 6006/tcp, 6064/tcp, 8888/tcp, 0.0.0.0:5000->5000/tcp
digits-338037152-0
[root@n79 ~]# lsof -i:5000
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
docker-pr 9567 root 4u IPv6 157608 0t0 TCP *:telelpathstart (LISTEN)
Digits looks like this. We'll probably never use it and can be restarted. Note file system mapping of hosts /data/datasets to container /opt/datasets location. It also grabs all GPUs. Log in as dockuser.
{{:cluster:digits.png?nolink&800|}}
==== Portainer ====
Portainer is a simple management solution for Docker which allows browser access on http://localhost:9000 with admin credentials. We'll probably never use this but use the command line instead (docker ps, docker kill, etc).
DATE=$(date +%N) # nanoseconds as unique id
docker run --name portainer-$DATE-0 -d -p 9000:9000 \
-v "/var/run/docker.sock:/var/run/docker.sock" --restart=always \
portainer/portainer
Portainer looks like this.
{{:cluster:portainer.png?nolink&800|}}
==== What's Running? ====
* Docker Version 19.03.5
* NVIDIA-Docker2 2.2.2-1
Video was "onboard" which means Gnome/X jump on first GPU scanned in pci bus (not always the same one!). We set configuration ''multi-user.target'' so we have video on VGA rather than GPU. The BIOS onboard video enabled setting was not changed, for now.
# pulled images
[hmeij@n79 ~]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/cuda 10.1-devel 9e47e9dfcb9a 2 months ago 2.83GB
portainer/portainer latest ff4ee4caaa23 2 months ago 81.6MB
nvidia/cuda 9.2-devel 1874839f75d5 3 months ago 2.35GB
nvcr.io/nvidia/cuda 9.2-devel 1874839f75d5 3 months ago 2.35GB
nvcr.io/nvidia/cuda 10.0-devel f765411c4ae6 3 months ago 2.29GB
nvcr.io/nvidia/digits 19.09-tensorflow b08982c9545c 5 months ago 8.85GB
nvcr.io/nvidia/tensorflow 19.09-py2 b82bcb185286 5 months ago 7.88GB
nvcr.io/nvidia/pytorch 19.09-py3 9d6f9ccfbe31 5 months ago 9.15GB
nvcr.io/nvidia/caffe 19.09-py2 b52fbbef7e6b 6 months ago 5.15GB
nvcr.io/nvidia/rapidsai/rapidsai 0.9-cuda10.0-runtime-centos7 22b5dc2f7e84 6 months ago 5.84GB
adding 09/17/2024
https://hub.docker.com/r/mobigroup/pygmtsar-large
docker pull mobigroup/pygmtsar-large
Status: Downloaded newer image for mobigroup/pygmtsar-large:latest
docker.io/mobigroup/pygmtsar-large:latest
# running containers (persistent across boot events)
[hmeij@n79 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cd84397823ea nvcr.io/nvidia/digits:19.09-tensorflow "/usr/local/bin/nvid…" 35 minutes ago Up 32 minutes 6006/tcp, 6064/tcp, 0.0.0.0:5000->5000/tcp, 8888/tcp digits-943126479-0
83adc657d558 portainer/portainer "/portainer" 5 weeks ago Up 32 minutes 0.0.0.0:9000->9000/tcp
# to be able to do this as non-root add user to docker group
[root@n79 ~]# ls -l /var/run/docker.sock
srw-rw---- 1 root docker 0 Feb 28 08:38 /var/run/docker.sock
# warning: this is a security risk on "docker host" (can be done via sudo too, or rootless)
# https://docs.docker.com/install/linux/linux-postinstall/
[root@n79 ~]# grep docker /etc/group
dockeruser:x:1001:
docker:x:981:hmeij
==== Setup ====
For a more detailed read on how to install Docker consult [[cluster:187|NGC Docker Containers]]. Exxact delivered these nodes preconfigured with Deep Learning Software Stack. This page details what those nodes look like.
\\
**[[cluster:0|Back]]**