DokuWiki

EXX96

A page for me on how these 12 nodes were build up after they arrived. To make them “ala n37” which was the test node in redoing our K20 nodes, see K20 Redo and K20 Redo Usage

Page best followed bottom to top if interested in the whole process.

The Usage section below is HPCC users wnatig to use queue exx96.

Debug for node n89 which turns itself off…grrhhh. Create a usb bootable stick with https://rufus.ie/ then unzip BIOS and firmware zip files located in n89:/usr/local/src

[root@n89 ~]# ipmitool sel elist
   1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted
   2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted
   3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted
...[snip]...

[root@n89 ~]# ipmitool sdr elist
CPU1 Temperature | 31h | ok  |  3.0 | 43 degrees C
CPU2 Temperature | 32h | ok  |  0.0 | 40 degrees C
PSU1 Over Temp   | 92h | ok  |  0.0 | Transition to OK
PSU2 Over Temp   | 9Ah | ok  |  0.0 | Transition to OK
...[snip]...
DIMMM1_Temp      | E4h | ok  |  3.0 | 28 degrees C
CPU1_ECC1        | D1h | ok  |  0.0 | Presence Detected
CPU2_ECC1        | D3h | ok  |  0.0 | Presence Detected
...[snip]...
PMBPower1        | E1h | ok  |  3.0 | 88 Watts
PMBPower2        | E2h | ok  |  3.0 | 112 Watts
...[snip]...
FRNT_FAN1        | A2h | ok  |  0.0 | 3100 RPM
../.[snip]...
PSU1 Slow FAN1   | 95h | ok  |  0.0 | Transition to OK
PSU2 Slow FAN1   | 9Dh | ok  |  0.0 | Transition to OK
...[snip]...


[root@n89 ~]#dmidecode -t0
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2 present.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 5102
        Release Date: 02/11/2019
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 32 MB
        Characteristics:
...[snip]...
                UEFI is supported
        BIOS Revision: 5.14


[root@n89 ~]# edac-util -s -v
edac-util: EDAC drivers are loaded. 4 MCs detected:
  mc0:Skylake Socket#0 IMC#0
  mc1:Skylake Socket#0 IMC#1
  mc2:Skylake Socket#1 IMC#0
  mc3:Skylake Socket#1 IMC#1
[root@n89 ~]# edac-util
edac-util: No errors to report.

syslog

Usage

The new queue exx96 will be comprised of nodes n79-n90. Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz 12-core cpus, 96 GB memory and a 1TB SSD. /localscratch is around 800 GB.

A new static resource is introduced for all nodes holding gpus. n78 in queue amber128 and n33-n37 in queue mwgpu and the nodes mentioned above. The name of this resource is gpu4. Moving forward please use it instead of gpu or gputest.

The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have #BSUB -n 1 and in your resource allocation line gpu4=1. If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines #BSUB -n 4 and gpu4=2 in your submit script. Sample script at /home/hmeij/k20redo/run.rtx

The wrappers (n78.mpich3.wrapper for n78, and n37.openmpi.wrapper for all others) are located in /usr/local/bin and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from /usr/local.

# command that shows gpu reservations
bhosts -l n79
             gputest gpu4
 Total             0    3
 Reserved        0.0  1.0

# old way of doing that
lsload -l n79

HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu
n79                         ok   0.0   0.0   0.0   0%   0.0     0   0 2e+08  826G   10G   90G    3.0

Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. This will get rather messy in setting up. Some quick off the cuff performance data reveals some impact. Generally in our environment the gains are not worth the effort. Using Amber and pmemd.cuda.MPI

                                                                              cpu:gpu
mdout.325288:|  Master Total CPU time:          982.60 seconds     0.27 hours   1:1
mdout.325289:|  Master Total CPU time:          611.08 seconds     0.17 hours   4:2
mdout.326208:|  Master Total CPU time:          537.97 seconds     0.15 hours  36:4

Miscellaneous

Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper.

Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource gpu4.

Add nodes to ZenOSS hpcmon.

Propagate global known_hosts files in users ~/.ssh/ dirs.

Look at how accounting ties in with resource request gpu4= versus gpu= …

# propagate global passwd, shadow, group, hosts file

# add to date_ctt2.sh script, get and set date

NOW=`/bin/date +%m%d%H%M%Y.%S`
for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done

# crontab

# ionice gaussian
0,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh  > /dev/null 2>&1

# cpu temps
40 * * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&1
 
# rc.local, chmod o+x /etc/rc.d/rc.local, then add

# for mapd, 'All On' enable graphicsrendering support
#/usr/bin/nvidia-smi --gom=0

# for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS
#nvidia-smi --persistence-mode=1
#nvidia-smi --compute-mode=1

# for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT
# note: turned this off, running with defaults
# seems stable, maybe persistence later on
# lets see how docker interacts first...
#nvidia-smi --persistence-mode=1
#nvidia-smi --compute-mode=0

# turn ECC off (memory scrubbing)
#/usr/bin/nvidia-smi -e 0

# lm_sensor
modprobe coretemp
modprobe tmp401
#modprobe w83627ehf
 
reboot

Recipe

Steps. “Ala n37” … so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See K20 Redo page. First we add these packages and clean up.

# hook up VDI-D cable to GPU port (offboard video)
# login as root check some things out...
free -g
nvidia-smi
docker images
docker ps
# set local time zone
mv /etc/localtime /etc/localtime.backup
ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
# change passwords for root and vendor account
passwd
passwd exx
# set hostname
hostnamectl set-hostname n79
# configure private subnets and ping file server
cd /etc/sysconfig/network-scripts/
vi ifcfg-enp1s0f0
vi ifcfg-enp1s0f1
systemctl restart network
ping -c 3 192.168.102.42
ping -c 3 10.10.102.42
# make internet connection for yum
ifdown enp1s0f0
vi ifcfg-enp1s0f0
systemctl restart network
dig google.com
#centos7
yum install -y iptables-services
vi /etc/sysconfig/iptables
systemctl start iptables
iptables -L
systemctl stop firewalld
systemctl disable firewalld
# other configs
vi /etc/selinux/config (disabled)
mv /home /usr/local/
mkdir /home
vi /etc/passwd (exx, dockeruser $HOME)
mkdir /sanscratch /localscratch
chmod ugo+rwx /sanscratch /localscratch
chmod o+t /sanscratch /localscratch
ln -s /home /share
ssh-keygen -t rsa
scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/
/etc/ssh/sshd_config (PermitRootLogin)
echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
# add packages and update
yum install epel-release -y
yum install flex flex-devel bison bison-devel -y 
yum install tcl tcl-devel dmtcp -y
yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y
yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y
yum install blas blas-devel lapack lapack-devel boost boost-devel -y
yum install tkinter lm_sensors lm_sensors-libs -y
yum install zlib-devel bzip2-devel bzip bzip-devel -y
yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker -y
yum install cmake cmake-devel -y
yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
# amber
yum -y install tcsh make \
               gcc gcc-gfortran gcc-c++ \
               which flex bison patch bc \
               libXt-devel libXext-devel \
               perl perl-ExtUtils-MakeMaker util-linux wget \
               bzip2 bzip2-devel zlib-devel tar 
yum update -y
yum clean all
# remove internet, bring private back up
ifdown enp1s0f0
vi ifcfg-enp1s0f0
ifup enp1s0f0
# passwd, shadow, group, hosts, fstab
mkdir /homeextra1 /homeextra2 /home33 /mindstore
cd /etc/
# backup files to -orig versions
scp 192.168.102.89:/etc/passwd /etc/passwd (and others)
scp 10.10.102.89:/etc/fstab /tmp
vi /etc/fstab
mount -a; df -h
# pick the kernel vendor used for now
grep ^menuentry /etc/grub2.cfg
grub2-set-default 1
ls -d /sys/firmware/efi && echo "EFI" || echo "Legacy"
grub2-mkconfig -o /boot/grub2/grub.cfg
#grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
# old level 3
systemctl set-default multi-user.target
reboot
# switch to VGA
cd /usr/local/src/
tar zxf n37.chroot-keep.ul.tar.gz
cd usr/local/
mv amber16/  fsl-5.0.10/ gromacs-2018/ lammps-22Aug18/ /usr/local/
mv cuda-9.2/ /usr/local/n37-cuda-9.2/
cd /usr/local/bin/
rsync -vac 10.10.102.89:/usr/local/bin/ /usr/local/bin/
# test scripts gpu-free, gpu-info, gpu-process
0,1,2,3
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 %
1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 %
2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
gpu_name, gpu_bus_id, pid, process_name
GeForce RTX 2080 SUPER, 00000000:3B:00.0, 3109, python
# done

What We Purchased

12 nodes yielding a total of
- 24 cpus
- 288 cpu cores
- 1,152 gb cpu mem
- ~20 Tflops (dpfp)
- 48 gpus
- 384 gpu mem
- ~700 Tflops (mixed mode)

# docker images

REPOSITORY                         TAG                            IMAGE ID            CREATED             SIZE
nvcr.io/nvidia/cuda                10.1-devel                     9e47e9dfcb9a        2 months ago        2.83GB
portainer/portainer                latest                         ff4ee4caaa23        2 months ago        81.6MB
nvidia/cuda                        9.2-devel                      1874839f75d5        2 months ago        2.35GB
nvcr.io/nvidia/cuda                9.2-devel                      1874839f75d5        2 months ago        2.35GB
nvcr.io/nvidia/cuda                10.0-devel                     f765411c4ae6        2 months ago        2.29GB
nvcr.io/nvidia/digits              19.09-tensorflow               b08982c9545c        4 months ago        8.85GB
nvcr.io/nvidia/tensorflow          19.09-py2                      b82bcb185286        4 months ago        7.88GB
nvcr.io/nvidia/pytorch             19.09-py3                      9d6f9ccfbe31        5 months ago        9.15GB
nvcr.io/nvidia/caffe               19.09-py2                      b52fbbef7e6b        5 months ago        5.15GB
nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB

free -m
              total        used        free      shared  buff/cache   available
Mem:          95056        1919       85338          20        7798       92571
Swap:         10239           0       10239


# nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31       Driver Version: 440.31       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 24%   24C    P8     8W / 250W |    275MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 25%   24C    P8    10W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 24%   23C    P8     4W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:D8:00.0 Off |                  N/A |
| 25%   22C    P8    13W / 250W |     12MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3127      C   python                                       115MiB |
|    0      5715      G   /usr/bin/X                                    84MiB |
|    0      6307      G   /usr/bin/gnome-shell                          70MiB |
+-----------------------------------------------------------------------------+

# df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         47G     0   47G   0% /dev
tmpfs            47G     0   47G   0% /dev/shm
tmpfs            47G   13M   47G   1% /run
tmpfs            47G     0   47G   0% /sys/fs/cgroup
/dev/nvme0n1p3  929G   42G  840G   5% /
/dev/nvme0n1p1  477M  199M  249M  45% /boot
overlay         929G   42G  840G   5% /var/lib/docker/overlay2 
/6f7af00a8eb8b5ede68fd6bc9be5f7220525bdde21c14e6f1643a2a7debc454b/merged
overlay         929G   42G  840G   5% /var/lib/docker/overlay2 
/9cf895d8a17106f16ba997e6025f5912abb988a512779caa2c35e2da3e7d196a/merged
tmpfs           9.3G   28K  9.3G   1% /run/user/0

# yum repolist

repo id                             repo name                             status
base/7/x86_64                       CentOS-7 - Base                       10,097
docker-ce-stable/x86_64             Docker CE Stable - x86_64                 63
extras/7/x86_64                     CentOS-7 - Extras                        323
libnvidia-container                 libnvidia-container                       65
nvidia-container-runtime            nvidia-container-runtime                  54
nvidia-docker                       nvidia-docker                             50
updates/7/x86_64                    CentOS-7 - Updates                     1,117