Differences

This shows you the differences between two versions of the page.

--- cluster:193 [2020/03/02 21:14]
hmeij07
+++ cluster:193 [2020/03/10 12:07]
hmeij07
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
-==== Docker Usage ====
+===== Docker Containers Usage =====
-=== Simple Runs ===
+Page build up from the bottom to top.  We're not making a traditional "MPI" docker integration with our scheduler.  We'll see what usage patterns will emergence and go from there. I can help with workflow.  If more containers are desired please let me know which ones to ''pull''.
+If users want to run web enabled applications in the container one simple workflow would be to submit a job that reserves a GPU then loops checking a lock file until removed.  Then ssh to the node and access the application via ''firefox http://localhost:PORT/''. For example DIGITS and JupyterLab described below.
+==== Readings ====
+Interesting reads...
+  * https://www.stackhpc.com/k8s-mpi.html
+  * https://www.stackhpc.com/the-state-of-hpc-containers.html
+==== Scheduler Runs ====
+Next add to the script the scheduler syntax that is needed. Request a gpu resource ''gpu4'', memory and submit jobs. Example is located at ''/home/hmeij/jobs/docker/run.docker''.  Notice that we use ''nvidia-docker'' [[https://thenewstack.io/primer-nvidia-docker-containers-meet-gpus/|External Link]].  Nvidia-Docker is basically a wrapper around the docker CLI that transparently provisions a container with the necessary dependencies to execute code on the GPU.
+<code>
+#!/bin/bash
+# submit via 'bsub < run.docker'
+rm -f out err
+#BSUB -e err
+#BSUB -o out
+#BSUB -q exx96
+#BSUB -J "RTX2080S docker"
+#BSUB -n 1
+#BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]"
+# should add a check we get an integer back in the 0-3 range
+gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` "
+echo ""; echo "running on gpu $HOSTNAME:$gpuid"; echo ""
+NV_GPU=$gpuid \
+nvidia-docker run --rm -u $(id -u):$(id -g) \
+-v /$HOME:/mnt/$USER \
+-v /home/apps:/mnt/apps \
+-v /usr/local:/mnt/local \
+nvcr.io/nvidia/tensorflow:19.09-py2 python \
+/mnt/$USER/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+--num_gpus=1 --batch_size=64 \
+--model=resnet50 \
+--variable_update=parameter_server
+# or run_tests.py
+</code>
+To make the ''imports'' work edit that python file
+<code>
+import sys
+sys.path.insert(0, '/mnt/hmeij/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/')
+</code>
+==== GPU Runs ====
+We put the tensorflow benchmark example in a script. It will find a free gpu, set it in a environment variable ''NV_GPU'' and then run the tensorflow application on the GPU allocated.
+<code>
+# execute script
+[hmeij@n79 docker]$ ./run.docker
+running on gpu n79:3   <--------- this is the physical gpu id
+# tensorflow starts up
+================
+== TensorFlow ==
+================
+NVIDIA Release 19.09 (build 8044706)
+TensorFlow Version 1.14.0
+Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.
+(snip output...)
+# details
+TensorFlow:  1.14
+Model:       resnet50
+Dataset:     imagenet (synthetic)
+Mode:        training
+SingleSess:  False
+Batch size:  64 global
+per device
+Num batches: 100
+Num epochs:  0.00
+Devices:     ['/gpu:0']   <--------- this is the logical gpu id
+NUMA bind:   False
+Data format: NCHW
+Optimizer:   sgd
+Variables:   parameter_server
+==========
+Generating training model
+Initializing graph
+Running warm up
+(snip output...)
+# query what is running on gpus ... D8 is gpu 3 (ssh n79 nvidia-smi to verify)
+[root@n79 ~]# gpu-process
+gpu_name, gpu_bus_id, pid, process_name
+GeForce RTX 2080 SUPER, 00000000:D8:00.0, 77747, python
+</code>
+==== Simple Runs ====
+Some simple interactive test runs. Map your home directory from the host inside the container, I choose /mnt but it can go anywhere but /home ... Also set up your uid/gid because the container will run as "root" in user namespace.
 <code>
@@ Line 11: / Line 123: @@
 [hmeij@n79 ~]$ nvidia-docker run --rm -v /$HOME:/mnt/$USER \
 -u $(id -u):$(id -g) nvcr.io/nvidia/cuda:10.0-devel id
 uid=8216 gid=623 groups=623
@@ Line 16: / Line 129: @@
 [hmeij@n79 ~]$ nvidia-docker run --rm -v /$HOME:/mnt/$USER \
 -u $(id -u):$(id -g) nvcr.io/nvidia/cuda:10.0-devel df /mnt/hmeij
 Filesystem                 1K-blocks       Used  Available Use% Mounted on
 .10.102.42:/home/hmeij 10735331328 8764345344 1970985984  82% /mnt/hmeij
@@ Line 24: / Line 138: @@
 touch /mnt/$USER/tmp/dockerout.stuff
-# check permissions on container host
+# check permissions on host running the container
 [hmeij@n79 ~]$  ls -l $HOME/tmp
 total 232
@@ Line 31: / Line 145: @@
 </code>
-=== Pull Images ===
+==== Pull Images ====
+Pull more images from the Nvidia Gpu Cloud Catalog.  There are also models.  As you can tell, not all containers applications are up to date.  Only pulled on node ''n79'', not expecting any usage.  It is nice to pull esoteric software like the deep learning stack (digits, tensorflow, pytorch, caffe, rapidsai).
 <code>
@@ Line 67: / Line 183: @@
 </code>
-=== JupyterLab ===
+==== JupyterLab ====
 The Rapids container and Notebook Server hide in the ''rapidsai:cuda10.0'' container, also an interactive browser enabled application. Again, not sure how we would use it but here is how to start it. JupyterLab is a next-generation web-based user interface for Project Jupyter. [[https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html|External Link]]. Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. [[https://jupyter.org/|External Link]]
@@ Line 97: / Line 213: @@
-=== Digits ===
+==== Digits ====
@@ Line 107: / Line 223: @@
 DATE=$( date +%N )  # nanoseconds unique id
-docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="0,1,2,3" --name digits-$DATE-0 -d -p 5000:5000 -v /data/datasets:/opt/datasets --restart=always nvcr.io/nvidia/digits:19.09-tensorflow
+docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="0,1,2,3" \
+--name digits-$DATE-0 -d -p 5000:5000 \
+-v /data/datasets:/opt/datasets --restart=always \
+nvcr.io/nvidia/digits:19.09-tensorflow
@@ Line 138: / Line 257: @@
 {{:cluster:digits.png?nolink&800|}}
-=== Portainer ===
+==== Portainer ====
 Portainer is a simple management solution for Docker which allows browser access on http://localhost:9000 with admin credentials. We'll probably never use this but use the command line instead (docker ps, docker kill, etc).
@@ Line 145: / Line 264: @@
 DATE=$(date +%N)  # nanoseconds as unique id
-docker run --name portainer-$DATE-0 -d -p 9000:9000 -v "/var/run/docker.sock:/var/run/docker.sock" --restart=always portainer/portainer
+docker run --name portainer-$DATE-0 -d -p 9000:9000 \
+-v "/var/run/docker.sock:/var/run/docker.sock" --restart=always \
+portainer/portainer
 </code>
@@ Line 154: / Line 275: @@
-=== What's Running? ===
+==== What's Running? ====
   * Docker Version 19.03.5
   * NVIDIA-Docker2 2.2.2-1
-Video was "onboard" which means Gnome/X jump on first GPU scanned in pci bus (not always the same one!). Fater configuration we set ''multi-user.target'' so we have video on VGA rather than GPU. The BIOS onboard video enabled setting was not changed, for now.
+Video was "onboard" which means Gnome/X jump on first GPU scanned in pci bus (not always the same one!). We set configuration ''multi-user.target'' so we have video on VGA rather than GPU. The BIOS onboard video enabled setting was not changed, for now.
 <code>
@@ Line 199: / Line 320: @@
-=== Setup ===
+==== Setup ====
 For a more detailed read on how to install Docker consult [[cluster:187|NGC Docker Containers]]. Exxact delivered these nodes preconfigured with Deep Learning Software Stack. This page details what those nodes look like.

DokuWiki

User Tools

Site Tools

Differences

Page Tools