This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
cluster:192 [2020/02/17 19:46] hmeij07 [Pictures] |
cluster:192 [2022/03/08 18:02] hmeij07 [Recipe] |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== EXX96 ===== | ===== EXX96 ===== | ||
- | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster: | + | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster: |
- | ==== WhatWeDo? | + | Page best followed bottom to top if interested in the whole process. |
+ | |||
+ | The Usage section below is HPCC users wnatig to use queue '' | ||
+ | |||
+ | Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https:// | ||
+ | |||
+ | < | ||
+ | |||
+ | [root@n89 ~]# ipmitool sel elist | ||
+ | 1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted | ||
+ | 2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted | ||
+ | 3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted | ||
+ | ...[snip]... | ||
+ | |||
+ | [root@n89 ~]# ipmitool sdr elist | ||
+ | CPU1 Temperature | 31h | ok | 3.0 | 43 degrees C | ||
+ | CPU2 Temperature | 32h | ok | 0.0 | 40 degrees C | ||
+ | PSU1 Over Temp | 92h | ok | 0.0 | Transition to OK | ||
+ | PSU2 Over Temp | 9Ah | ok | 0.0 | Transition to OK | ||
+ | ...[snip]... | ||
+ | DIMMM1_Temp | ||
+ | CPU1_ECC1 | ||
+ | CPU2_ECC1 | ||
+ | ...[snip]... | ||
+ | PMBPower1 | ||
+ | PMBPower2 | ||
+ | ...[snip]... | ||
+ | FRNT_FAN1 | ||
+ | ../ | ||
+ | PSU1 Slow FAN1 | 95h | ok | 0.0 | Transition to OK | ||
+ | PSU2 Slow FAN1 | 9Dh | ok | 0.0 | Transition to OK | ||
+ | ...[snip]... | ||
+ | |||
+ | |||
+ | [root@n89 ~]# | ||
+ | # dmidecode 3.2 | ||
+ | Getting SMBIOS data from sysfs. | ||
+ | SMBIOS 3.2 present. | ||
+ | |||
+ | Handle 0x0000, DMI type 0, 26 bytes | ||
+ | BIOS Information | ||
+ | Vendor: American Megatrends Inc. | ||
+ | Version: 5102 | ||
+ | Release Date: 02/ | ||
+ | Address: 0xF0000 | ||
+ | Runtime Size: 64 kB | ||
+ | ROM Size: 32 MB | ||
+ | Characteristics: | ||
+ | ...[snip]... | ||
+ | UEFI is supported | ||
+ | BIOS Revision: 5.14 | ||
+ | |||
+ | |||
+ | [root@n89 ~]# edac-util -s -v | ||
+ | edac-util: EDAC drivers are loaded. 4 MCs detected: | ||
+ | mc0:Skylake Socket#0 IMC#0 | ||
+ | mc1:Skylake Socket#0 IMC#1 | ||
+ | mc2:Skylake Socket#1 IMC#0 | ||
+ | mc3:Skylake Socket#1 IMC#1 | ||
+ | [root@n89 ~]# edac-util | ||
+ | edac-util: No errors to report. | ||
+ | |||
+ | syslog | ||
+ | |||
+ | </ | ||
+ | ==== Usage ==== | ||
+ | |||
+ | The new queue '' | ||
+ | |||
+ | A new static resource is introduced for all nodes holding gpus. '' | ||
+ | |||
+ | The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''# | ||
+ | |||
+ | The wrappers (n78.mpich3.wrapper for '' | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | # command that shows gpu reservations | ||
+ | bhosts -l n79 | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | # old way of doing that | ||
+ | lsload -l n79 | ||
+ | |||
+ | HOST_NAME | ||
+ | n79 | ||
+ | |||
+ | </ | ||
+ | |||
+ | Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. | ||
+ | |||
+ | < | ||
+ | cpu:gpu | ||
+ | mdout.325288: | ||
+ | mdout.325289: | ||
+ | mdout.326208: | ||
+ | |||
+ | </ | ||
+ | ==== Miscellaneous ==== | ||
+ | |||
+ | Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. | ||
+ | |||
+ | Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource '' | ||
+ | |||
+ | Add nodes to ZenOSS hpcmon. | ||
+ | |||
+ | Propagate global '' | ||
+ | |||
+ | Look at how accounting ties in with resource request '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # propagate global passwd, shadow, group, hosts file | ||
+ | |||
+ | # add to date_ctt2.sh script, get and set date | ||
+ | |||
+ | NOW=`/ | ||
+ | for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done | ||
+ | |||
+ | # crontab | ||
+ | |||
+ | # ionice gaussian | ||
+ | 0,15,30,45 * * * * / | ||
+ | |||
+ | # cpu temps | ||
+ | 40 * * * * / | ||
+ | |||
+ | # rc.local, chmod o+x / | ||
+ | |||
+ | # for mapd, 'All On' enable graphicsrendering support | ||
+ | #/ | ||
+ | |||
+ | # for amber16 -pm=1/ | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=1 | ||
+ | |||
+ | # for mwgpu/exx96 -pm=1/ | ||
+ | # note: turned this off, running with defaults | ||
+ | # seems stable, maybe persistence later on | ||
+ | # lets see how docker interacts first... | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=0 | ||
+ | |||
+ | # turn ECC off (memory scrubbing) | ||
+ | #/ | ||
+ | |||
+ | # lm_sensor | ||
+ | modprobe coretemp | ||
+ | modprobe tmp401 | ||
+ | #modprobe w83627ehf | ||
+ | |||
+ | reboot | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Recipe | ||
Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster: | Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster: | ||
Line 12: | Line 170: | ||
< | < | ||
- | yum install epel-release | + | # hook up VDI-D cable to GPU port (offboard video) |
- | yum install tcl tcl-devel dmtcp | + | # login as root check some things out... |
- | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel | + | free -g |
- | yum install blas blas-devel lapack lapack-devel boost boost-devel | + | nvidia-smi |
- | yum install tkinter lm_sensors lm_sensors-libs | + | docker images |
- | yum install zlib-devel bzip2-devel bzip bzip-devel | + | docker ps |
- | yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker | + | # set local time zone |
- | yum install cmake cmake-devel | + | mv / |
- | yum install libjpeg libjpeg-devel libjpeg-turbo-devel | + | ln -s / |
+ | # change passwords for root and vendor account | ||
+ | passwd | ||
+ | passwd exx | ||
+ | # set hostname | ||
+ | hostnamectl set-hostname n79 | ||
+ | # configure private subnets and ping file server | ||
+ | cd / | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | vi ifcfg-enp1s0f1 | ||
+ | systemctl restart network | ||
+ | ping -c 3 192.168.102.42 | ||
+ | ping -c 3 10.10.102.42 | ||
+ | # make internet connection for yum | ||
+ | ifdown enp1s0f0 | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | systemctl restart network | ||
+ | dig google.com | ||
+ | #centos7 | ||
+ | yum install -y iptables-services | ||
+ | vi / | ||
+ | systemctl start iptables | ||
+ | iptables -L | ||
+ | systemctl stop firewalld | ||
+ | systemctl disable firewalld | ||
+ | # centos8 | ||
+ | # allow ports 0:65526 for priv networks | ||
+ | |||
+ | # other configs | ||
+ | vi / | ||
+ | mv /home / | ||
+ | mkdir /home | ||
+ | vi /etc/passwd (exx, dockeruser $HOME) | ||
+ | mkdir /sanscratch / | ||
+ | chmod ugo+rwx /sanscratch / | ||
+ | chmod o+t /sanscratch / | ||
+ | ln -s /home /share | ||
+ | ssh-keygen -t rsa | ||
+ | scp 10.10.102.253:/ | ||
+ | / | ||
+ | echo " | ||
+ | |||
+ | # add packages | ||
+ | xmgrace xmgrace-devel, | ||
+ | # add packages and update | ||
+ | yum install epel-release | ||
+ | yum install flex flex-devel bison bison-devel -y | ||
+ | yum install tcl tcl-devel dmtcp -y | ||
+ | yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y | ||
+ | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel | ||
+ | yum install blas blas-devel lapack lapack-devel boost boost-devel | ||
+ | yum install tkinter lm_sensors lm_sensors-libs | ||
+ | yum install zlib-devel bzip2-devel bzip bzip-devel | ||
+ | yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker | ||
+ | yum install cmake cmake-devel | ||
+ | yum install libjpeg libjpeg-devel libjpeg-turbo-devel | ||
+ | # amber | ||
+ | yum -y install tcsh make \ | ||
+ | gcc gcc-gfortran gcc-c++ \ | ||
+ | which flex bison patch bc \ | ||
+ | | ||
+ | perl perl-ExtUtils-MakeMaker util-linux wget \ | ||
+ | bzip2 bzip2-devel zlib-devel tar | ||
+ | yum update -y | ||
yum clean all | yum clean all | ||
+ | # remove internet, bring private back up | ||
+ | ifdown enp1s0f0 | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | ifup enp1s0f0 | ||
+ | # passwd, shadow, group, hosts, fstab | ||
+ | mkdir /homeextra1 /homeextra2 /home33 /mindstore | ||
+ | cd /etc/ | ||
+ | # backup files to -orig versions | ||
+ | scp 192.168.102.89:/ | ||
+ | scp 10.10.102.89:/ | ||
+ | vi /etc/fstab | ||
+ | mount -a; df -h | ||
+ | # pick the kernel vendor used for now | ||
+ | grep ^menuentry / | ||
+ | grub2-set-default 1 | ||
+ | ls -d / | ||
+ | grub2-mkconfig -o / | ||
+ | # | ||
+ | # old level 3 | ||
+ | systemctl set-default multi-user.target | ||
+ | reboot | ||
+ | # switch to VGA | ||
+ | cd / | ||
+ | tar zxf n37.chroot-keep.ul.tar.gz | ||
+ | cd usr/local/ | ||
+ | mv amber16/ | ||
+ | mv cuda-9.2/ / | ||
+ | cd / | ||
+ | rsync -vac 10.10.102.89:/ | ||
+ | # test scripts gpu-free, gpu-info, gpu-process | ||
+ | 0,1,2,3 | ||
+ | id, | ||
+ | 0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 % | ||
+ | 1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | 2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | 3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | gpu_name, gpu_bus_id, pid, process_name | ||
+ | GeForce RTX 2080 SUPER, 00000000: | ||
+ | # done | ||
</ | </ | ||
- | ==== WhatWeGot? | + | ==== What We Purchased |
* 12 nodes yielding a total of | * 12 nodes yielding a total of | ||
Line 52: | Line 312: | ||
nvcr.io/ | nvcr.io/ | ||
- | # free -g | + | free -m |
total used free shared | total used free shared | ||
- | Mem: 92 | + | Mem: |
+ | Swap: | ||
# nvidia-smi | # nvidia-smi | ||
Line 117: | Line 379: | ||
{{: | {{: | ||
- | {{: | + | {{: |
{{: | {{: | ||
{{: | {{: |