User Tools

Site Tools


cluster:193

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:193 [2020/03/05 13:51]
hmeij07
cluster:193 [2020/03/10 08:31] (current)
hmeij07 [Readings]
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
-===== Docker Usage =====+===== Docker Containers Usage =====
  
 Page build up from the bottom to top.  We're not making a traditional "MPI" docker integration with our scheduler.  We'll see what usage patterns will emergence and go from there. I can help with workflow.  If more containers are desired please let me know which ones to ''pull''. Page build up from the bottom to top.  We're not making a traditional "MPI" docker integration with our scheduler.  We'll see what usage patterns will emergence and go from there. I can help with workflow.  If more containers are desired please let me know which ones to ''pull''.
  
-If users want to run web enabled applications in the container one simple workflow would be to submit a job that reserves a GPU then loops checking a lock file until removed.  Then ssh to the node and access the application via ''firefox http://localhost:PORT/''. For example DIGITS and JupyterLab below.+If users want to run web enabled applications in the container one simple workflow would be to submit a job that reserves a GPU then loops checking a lock file until removed.  Then ssh to the node and access the application via ''firefox http://localhost:PORT/''. For example DIGITS and JupyterLab described below. 
 + 
 +==== Readings ==== 
 + 
 +Interesting reads... 
 + 
 +  * https://www.stackhpc.com/k8s-mpi.html 
 +    * PMI(x), Slurm 
 + 
 +  * https://www.stackhpc.com/the-state-of-hpc-containers.html 
 +    * Docker, Kubernetes, Singularity, Shifter, CharleiCloud
    
 +  * https://en.wikipedia.org/wiki/HAProxy
 +    * HA load balancing with Docker images for CentOS
 ==== Scheduler Runs ==== ==== Scheduler Runs ====
  
-Next add to the script the scheduler syntax that is needed. Request a gpu resource ''gpu4'', memory and submit jobs. Example is located at ''home/hmeij/jobs/docker/run.docker'' Notice that we use ''nvidia-docker'' [[https://thenewstack.io/primer-nvidia-docker-containers-meet-gpus/|External Link]].  Nvidia-Docker is basically a wrapper around the docker CLI that transparently provisions a container with the necessary dependencies to execute code on the GPU.+Next add to the script the scheduler syntax that is needed. Request a gpu resource ''gpu4'', memory and submit jobs. Example is located at ''/home/hmeij/jobs/docker/run.docker'' Notice that we use ''nvidia-docker'' [[https://thenewstack.io/primer-nvidia-docker-containers-meet-gpus/|External Link]].  Nvidia-Docker is basically a wrapper around the docker CLI that transparently provisions a container with the necessary dependencies to execute code on the GPU.
  
 <code> <code>
Line 25: Line 37:
 #BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]" #BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]"
  
 +# should add a check we get an integer back in the 0-3 range
 gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` " gpuid="` gpu-free | sed "s/,/\n/g" | shuf | head -1 ` "
 echo ""; echo "running on gpu $HOSTNAME:$gpuid"; echo "" echo ""; echo "running on gpu $HOSTNAME:$gpuid"; echo ""
Line 38: Line 51:
 --model=resnet50 \ --model=resnet50 \
 --variable_update=parameter_server --variable_update=parameter_server
 +# or run_tests.py
  
 +</code>
 +
 +To make the ''imports'' work edit that python file
 +
 +<code>
 +
 +import sys
 +sys.path.insert(0, '/mnt/hmeij/jobs/docker/benchmarks-master/scripts/tf_cnn_benchmarks/')
  
 </code> </code>
Line 44: Line 66:
 ==== GPU Runs ==== ==== GPU Runs ====
  
-We put the tensorflow benchmark example in a script. It will find a free gpu, set it in a environment variable ''NV__GPU'' and then run the tensorflow application on the GPU allocated.+We put the tensorflow benchmark example in a script. It will find a free gpu, set it in a environment variable ''NV_GPU'' and then run the tensorflow application on the GPU allocated.
  
 <code> <code>
Line 63: Line 85:
 Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved. Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
 Copyright 2017-2019 The TensorFlow Authors.  All rights reserved. Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.
-(deleted content...)+(snip output...)
  
 # details # details
Line 84: Line 106:
 Initializing graph Initializing graph
 Running warm up Running warm up
-(deleted content...it crashes but we can see it running)+(snip output...)
  
 # query what is running on gpus ... D8 is gpu 3 (ssh n79 nvidia-smi to verify) # query what is running on gpus ... D8 is gpu 3 (ssh n79 nvidia-smi to verify)
Line 128: Line 150:
  
 ==== Pull Images ==== ==== Pull Images ====
 +
 +Pull more images from the Nvidia Gpu Cloud Catalog.  There are also models.  As you can tell, not all containers applications are up to date.  Only pulled on node ''n79'', not expecting any usage.  It is nice to pull esoteric software like the deep learning stack (digits, tensorflow, pytorch, caffe, rapidsai).
  
 <code> <code>
Line 203: Line 227:
 DATE=$( date +%N )  # nanoseconds unique id DATE=$( date +%N )  # nanoseconds unique id
  
-docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="0,1,2,3" --name digits-$DATE-0 -d -p 5000:5000 -v /data/datasets:/opt/datasets --restart=always nvcr.io/nvidia/digits:19.09-tensorflow+docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="0,1,2,3" 
 +--name digits-$DATE-0 -d -p 5000:5000 
 +-v /data/datasets:/opt/datasets --restart=always 
 +nvcr.io/nvidia/digits:19.09-tensorflow
  
  
Line 241: Line 268:
  
 DATE=$(date +%N)  # nanoseconds as unique id DATE=$(date +%N)  # nanoseconds as unique id
-docker run --name portainer-$DATE-0 -d -p 9000:9000 -v "/var/run/docker.sock:/var/run/docker.sock" --restart=always portainer/portainer+docker run --name portainer-$DATE-0 -d -p 9000:9000 
 +-v "/var/run/docker.sock:/var/run/docker.sock" --restart=always 
 +portainer/portainer
  
 </code> </code>
Line 255: Line 284:
   * NVIDIA-Docker2 2.2.2-1   * NVIDIA-Docker2 2.2.2-1
  
-Video was "onboard" which means Gnome/X jump on first GPU scanned in pci bus (not always the same one!). Fater configuration we set ''multi-user.target'' so we have video on VGA rather than GPU. The BIOS onboard video enabled setting was not changed, for now.+Video was "onboard" which means Gnome/X jump on first GPU scanned in pci bus (not always the same one!). We set configuration ''multi-user.target'' so we have video on VGA rather than GPU. The BIOS onboard video enabled setting was not changed, for now.
  
 <code> <code>
Line 295: Line 324:
  
  
-=== Setup ===+==== Setup ====
  
 For a more detailed read on how to install Docker consult [[cluster:187|NGC Docker Containers]]. Exxact delivered these nodes preconfigured with Deep Learning Software Stack. This page details what those nodes look like. For a more detailed read on how to install Docker consult [[cluster:187|NGC Docker Containers]]. Exxact delivered these nodes preconfigured with Deep Learning Software Stack. This page details what those nodes look like.
cluster/193.1583434269.txt.gz · Last modified: 2020/03/05 13:51 by hmeij07