\\
**[[cluster:0|Back]]**
===== EXX96 =====
A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] and [[cluster:173|K20 Redo Usage]]
Page best followed bottom to top if interested in the whole process.
The Usage section below is HPCC users wnatig to use queue ''exx96''.
Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https://rufus.ie/ then unzip BIOS and firmware zip files located in ''n89:/usr/local/src''
[root@n89 ~]# ipmitool sel elist
1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted
2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted
3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted
...[snip]...
[root@n89 ~]# ipmitool sdr elist
CPU1 Temperature | 31h | ok | 3.0 | 43 degrees C
CPU2 Temperature | 32h | ok | 0.0 | 40 degrees C
PSU1 Over Temp | 92h | ok | 0.0 | Transition to OK
PSU2 Over Temp | 9Ah | ok | 0.0 | Transition to OK
...[snip]...
DIMMM1_Temp | E4h | ok | 3.0 | 28 degrees C
CPU1_ECC1 | D1h | ok | 0.0 | Presence Detected
CPU2_ECC1 | D3h | ok | 0.0 | Presence Detected
...[snip]...
PMBPower1 | E1h | ok | 3.0 | 88 Watts
PMBPower2 | E2h | ok | 3.0 | 112 Watts
...[snip]...
FRNT_FAN1 | A2h | ok | 0.0 | 3100 RPM
../.[snip]...
PSU1 Slow FAN1 | 95h | ok | 0.0 | Transition to OK
PSU2 Slow FAN1 | 9Dh | ok | 0.0 | Transition to OK
...[snip]...
[root@n89 ~]#dmidecode -t0
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2 present.
Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 5102
Release Date: 02/11/2019
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 32 MB
Characteristics:
...[snip]...
UEFI is supported
BIOS Revision: 5.14
[root@n89 ~]# edac-util -s -v
edac-util: EDAC drivers are loaded. 4 MCs detected:
mc0:Skylake Socket#0 IMC#0
mc1:Skylake Socket#0 IMC#1
mc2:Skylake Socket#1 IMC#0
mc3:Skylake Socket#1 IMC#1
[root@n89 ~]# edac-util
edac-util: No errors to report.
syslog
==== Usage ====
The new queue ''exx96'' will be comprised of nodes ''n79-n90''. Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz 12-core cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GB.
A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' and the nodes mentioned above. The name of this resource is ''gpu4''. Moving forward please use it instead of ''gpu'' or ''gputest''.
The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1''. If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx''
The wrappers (n78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''.
# command that shows gpu reservations
bhosts -l n79
gputest gpu4
Total 0 3
Reserved 0.0 1.0
# old way of doing that
lsload -l n79
HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem gpu
n79 ok 0.0 0.0 0.0 0% 0.0 0 0 2e+08 826G 10G 90G 3.0
Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. This will get rather messy in setting up. Some quick off the cuff performance data reveals some impact. Generally in our environment the gains are not worth the effort. Using Amber and ''pmemd.cuda.MPI''
cpu:gpu
mdout.325288:| Master Total CPU time: 982.60 seconds 0.27 hours 1:1
mdout.325289:| Master Total CPU time: 611.08 seconds 0.17 hours 4:2
mdout.326208:| Master Total CPU time: 537.97 seconds 0.15 hours 36:4
==== Miscellaneous ====
Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper.
Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource ''gpu4''.
Add nodes to ZenOSS hpcmon.
Propagate global ''known_hosts'' files in users ~/.ssh/ dirs.
Look at how accounting ties in with resource request ''gpu4='' versus ''gpu='' ...
# propagate global passwd, shadow, group, hosts file
# add to date_ctt2.sh script, get and set date
NOW=`/bin/date +%m%d%H%M%Y.%S`
for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done
# crontab
# ionice gaussian
0,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh > /dev/null 2>&1
# cpu temps
40 * * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&1
# rc.local, chmod o+x /etc/rc.d/rc.local, then add
# for mapd, 'All On' enable graphicsrendering support
#/usr/bin/nvidia-smi --gom=0
# for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS
#nvidia-smi --persistence-mode=1
#nvidia-smi --compute-mode=1
# for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT
# note: turned this off, running with defaults
# seems stable, maybe persistence later on
# lets see how docker interacts first...
#nvidia-smi --persistence-mode=1
#nvidia-smi --compute-mode=0
# turn ECC off (memory scrubbing)
#/usr/bin/nvidia-smi -e 0
# lm_sensor
modprobe coretemp
modprobe tmp401
#modprobe w83627ehf
reboot
==== Recipe ====
Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster:172|K20 Redo]] page. First we add these packages and clean up.
# hook up VDI-D cable to GPU port (offboard video)
# login as root check some things out...
free -g
nvidia-smi
docker images
docker ps
# set local time zone
mv /etc/localtime /etc/localtime.backup
ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
# change passwords for root and vendor account
passwd
passwd exx
# set hostname
hostnamectl set-hostname n79
# configure private subnets and ping file server
cd /etc/sysconfig/network-scripts/
vi ifcfg-enp1s0f0
vi ifcfg-enp1s0f1
systemctl restart network
ping -c 3 192.168.102.42
ping -c 3 10.10.102.42
# make internet connection for yum
ifdown enp1s0f0
vi ifcfg-enp1s0f0
systemctl restart network
dig google.com
#centos7
yum install -y iptables-services
vi /etc/sysconfig/iptables
systemctl start iptables
iptables -L
systemctl stop firewalld
systemctl disable firewalld
# other configs
vi /etc/selinux/config (disabled)
mv /home /usr/local/
mkdir /home
vi /etc/passwd (exx, dockeruser $HOME)
mkdir /sanscratch /localscratch
chmod ugo+rwx /sanscratch /localscratch
chmod o+t /sanscratch /localscratch
ln -s /home /share
ssh-keygen -t rsa
scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/
/etc/ssh/sshd_config (PermitRootLogin)
echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
# add packages and update
yum install epel-release -y
yum install flex flex-devel bison bison-devel -y
yum install tcl tcl-devel dmtcp -y
yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y
yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y
yum install blas blas-devel lapack lapack-devel boost boost-devel -y
yum install tkinter lm_sensors lm_sensors-libs -y
yum install zlib-devel bzip2-devel bzip bzip-devel -y
yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker -y
yum install cmake cmake-devel -y
yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
# amber
yum -y install tcsh make \
gcc gcc-gfortran gcc-c++ \
which flex bison patch bc \
libXt-devel libXext-devel \
perl perl-ExtUtils-MakeMaker util-linux wget \
bzip2 bzip2-devel zlib-devel tar
yum update -y
yum clean all
# remove internet, bring private back up
ifdown enp1s0f0
vi ifcfg-enp1s0f0
ifup enp1s0f0
# passwd, shadow, group, hosts, fstab
mkdir /homeextra1 /homeextra2 /home33 /mindstore
cd /etc/
# backup files to -orig versions
scp 192.168.102.89:/etc/passwd /etc/passwd (and others)
scp 10.10.102.89:/etc/fstab /tmp
vi /etc/fstab
mount -a; df -h
# pick the kernel vendor used for now
grep ^menuentry /etc/grub2.cfg
grub2-set-default 1
ls -d /sys/firmware/efi && echo "EFI" || echo "Legacy"
grub2-mkconfig -o /boot/grub2/grub.cfg
#grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
# old level 3
systemctl set-default multi-user.target
reboot
# switch to VGA
cd /usr/local/src/
tar zxf n37.chroot-keep.ul.tar.gz
cd usr/local/
mv amber16/ fsl-5.0.10/ gromacs-2018/ lammps-22Aug18/ /usr/local/
mv cuda-9.2/ /usr/local/n37-cuda-9.2/
cd /usr/local/bin/
rsync -vac 10.10.102.89:/usr/local/bin/ /usr/local/bin/
# test scripts gpu-free, gpu-info, gpu-process
0,1,2,3
id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 %
1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 %
2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
gpu_name, gpu_bus_id, pid, process_name
GeForce RTX 2080 SUPER, 00000000:3B:00.0, 3109, python
# done
==== What We Purchased ====
* 12 nodes yielding a total of
* 24 cpus
* 288 cpu cores
* 1,152 gb cpu mem
* ~20 Tflops (dpfp)
* 48 gpus
* 384 gpu mem
* ~700 Tflops (mixed mode)
# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/cuda 10.1-devel 9e47e9dfcb9a 2 months ago 2.83GB
portainer/portainer latest ff4ee4caaa23 2 months ago 81.6MB
nvidia/cuda 9.2-devel 1874839f75d5 2 months ago 2.35GB
nvcr.io/nvidia/cuda 9.2-devel 1874839f75d5 2 months ago 2.35GB
nvcr.io/nvidia/cuda 10.0-devel f765411c4ae6 2 months ago 2.29GB
nvcr.io/nvidia/digits 19.09-tensorflow b08982c9545c 4 months ago 8.85GB
nvcr.io/nvidia/tensorflow 19.09-py2 b82bcb185286 4 months ago 7.88GB
nvcr.io/nvidia/pytorch 19.09-py3 9d6f9ccfbe31 5 months ago 9.15GB
nvcr.io/nvidia/caffe 19.09-py2 b52fbbef7e6b 5 months ago 5.15GB
nvcr.io/nvidia/rapidsai/rapidsai 0.9-cuda10.0-runtime-centos7 22b5dc2f7e84 5 months ago 5.84GB
free -m
total used free shared buff/cache available
Mem: 95056 1919 85338 20 7798 92571
Swap: 10239 0 10239
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31 Driver Version: 440.31 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 24% 24C P8 8W / 250W | 275MiB / 7981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:5E:00.0 Off | N/A |
| 25% 24C P8 10W / 250W | 12MiB / 7982MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 24% 23C P8 4W / 250W | 12MiB / 7982MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:D8:00.0 Off | N/A |
| 25% 22C P8 13W / 250W | 12MiB / 7982MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3127 C python 115MiB |
| 0 5715 G /usr/bin/X 84MiB |
| 0 6307 G /usr/bin/gnome-shell 70MiB |
+-----------------------------------------------------------------------------+
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 47G 0 47G 0% /dev
tmpfs 47G 0 47G 0% /dev/shm
tmpfs 47G 13M 47G 1% /run
tmpfs 47G 0 47G 0% /sys/fs/cgroup
/dev/nvme0n1p3 929G 42G 840G 5% /
/dev/nvme0n1p1 477M 199M 249M 45% /boot
overlay 929G 42G 840G 5% /var/lib/docker/overlay2
/6f7af00a8eb8b5ede68fd6bc9be5f7220525bdde21c14e6f1643a2a7debc454b/merged
overlay 929G 42G 840G 5% /var/lib/docker/overlay2
/9cf895d8a17106f16ba997e6025f5912abb988a512779caa2c35e2da3e7d196a/merged
tmpfs 9.3G 28K 9.3G 1% /run/user/0
# yum repolist
repo id repo name status
base/7/x86_64 CentOS-7 - Base 10,097
docker-ce-stable/x86_64 Docker CE Stable - x86_64 63
extras/7/x86_64 CentOS-7 - Extras 323
libnvidia-container libnvidia-container 65
nvidia-container-runtime nvidia-container-runtime 54
nvidia-docker nvidia-docker 50
updates/7/x86_64 CentOS-7 - Updates 1,117
==== Pictures ====
{{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\
{{:cluster:hdmi_small.JPG?nolink&300|}} ports on gpu \\
{{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\
{{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\
{{:cluster:front_small.JPG?nolink&300|}} Front, all drive bays empty \\
{{:cluster:rack_small.JPG?nolink&300|}} Racking \\
{{:cluster:boxes_small.JPG?nolink&300|}} Boxes \\
\\
**[[cluster:0|Back]]**