User Tools

Site Tools


cluster:192

This is an old revision of the document!


Table of Contents


Back

EXX96

A page for me on how these 12 nodes were build up after they arrived. To make them “ala n37” which as the test node in redoing our K20 nodes, see K20 Redo

WhatWeDo?

Steps. “Ala n37” … so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See K20 Redo page. First we add these packages and clean up.

yum install epel-release
yum install tcl tcl-devel dmtcp
yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel
yum install blas blas-devel lapack lapack-devel boost boost-devel
yum install tkinter lm_sensors lm_sensors-libs
yum install zlib-devel bzip2-devel bzip bzip-devel
yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker
yum install cmake cmake-devel
yum install libjpeg libjpeg-devel libjpeg-turbo-devel
yum clean all

WhatWeGot?

  • 12 nodes yielding a total of
    • 24 cpus
    • 288 cpu cores
    • 1,152 gb cpu mem
    • ~20 Tflops (dpfp)
    • 48 gpus
    • 384 gpu mem
    • ~700 Tflops (mixed mode)
# docker images

REPOSITORY                         TAG                            IMAGE ID            CREATED             SIZE
nvcr.io/nvidia/cuda                10.1-devel                     9e47e9dfcb9a        2 months ago        2.83GB
portainer/portainer                latest                         ff4ee4caaa23        2 months ago        81.6MB
nvidia/cuda                        9.2-devel                      1874839f75d5        2 months ago        2.35GB
nvcr.io/nvidia/cuda                9.2-devel                      1874839f75d5        2 months ago        2.35GB
nvcr.io/nvidia/cuda                10.0-devel                     f765411c4ae6        2 months ago        2.29GB
nvcr.io/nvidia/digits              19.09-tensorflow               b08982c9545c        4 months ago        8.85GB
nvcr.io/nvidia/tensorflow          19.09-py2                      b82bcb185286        4 months ago        7.88GB
nvcr.io/nvidia/pytorch             19.09-py3                      9d6f9ccfbe31        5 months ago        9.15GB
nvcr.io/nvidia/caffe               19.09-py2                      b52fbbef7e6b        5 months ago        5.15GB
nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB

# free -g
              total        used        free      shared  buff/cache   available
Mem:             92           2          88           0           1          89

# nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31       Driver Version: 440.31       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 24%   24C    P8     8W / 250W |    275MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 25%   24C    P8    10W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 24%   23C    P8     4W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:D8:00.0 Off |                  N/A |
| 25%   22C    P8    13W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3127      C   python                                       115MiB |
|    0      5715      G   /usr/bin/X                                    84MiB |
|    0      6307      G   /usr/bin/gnome-shell                          70MiB |
+-----------------------------------------------------------------------------+

# df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         47G     0   47G   0% /dev
tmpfs            47G     0   47G   0% /dev/shm
tmpfs            47G   13M   47G   1% /run
tmpfs            47G     0   47G   0% /sys/fs/cgroup
/dev/nvme0n1p3  929G   42G  840G   5% /
/dev/nvme0n1p1  477M  199M  249M  45% /boot
overlay         929G   42G  840G   5% /var/lib/docker/overlay2 
/6f7af00a8eb8b5ede68fd6bc9be5f7220525bdde21c14e6f1643a2a7debc454b/merged
overlay         929G   42G  840G   5% /var/lib/docker/overlay2 
/9cf895d8a17106f16ba997e6025f5912abb988a512779caa2c35e2da3e7d196a/merged
tmpfs           9.3G   28K  9.3G   1% /run/user/0

# yum repolist

repo id                             repo name                             status
base/7/x86_64                       CentOS-7 - Base                       10,097
docker-ce-stable/x86_64             Docker CE Stable - x86_64                 63
extras/7/x86_64                     CentOS-7 - Extras                        323
libnvidia-container                 libnvidia-container                       65
nvidia-container-runtime            nvidia-container-runtime                  54
nvidia-docker                       nvidia-docker                             50
updates/7/x86_64                    CentOS-7 - Updates                     1,117

Pictures

Yea, found 1T SSD
HDMI ports on gpu
GPU detail, blower model
Back, gpus stacked 2 on 2
Front, all drive bays empty
Racking
Boxes


Back

cluster/192.1581968815.txt.gz · Last modified: 2020/02/17 19:46 by hmeij07