This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:192 [2020/02/21 15:50] hmeij07 |
cluster:192 [2022/02/10 15:49] hmeij07 |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== EXX96 ===== | ===== EXX96 ===== | ||
- | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster: | + | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster: |
- | Page best read bottom to top. | + | Page best followed |
+ | |||
+ | The Usage section below is HPCC users wnatig to use queue '' | ||
+ | |||
+ | Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https:// | ||
+ | |||
+ | < | ||
+ | |||
+ | [root@n89 ~]# ipmitool sel elist | ||
+ | 1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted | ||
+ | 2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted | ||
+ | 3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted | ||
+ | ...[snip]... | ||
+ | |||
+ | [root@n89 ~]# ipmitool sdr elist | ||
+ | CPU1 Temperature | 31h | ok | 3.0 | 43 degrees C | ||
+ | CPU2 Temperature | 32h | ok | 0.0 | 40 degrees C | ||
+ | PSU1 Over Temp | 92h | ok | 0.0 | Transition to OK | ||
+ | PSU2 Over Temp | 9Ah | ok | 0.0 | Transition to OK | ||
+ | ...[snip]... | ||
+ | DIMMM1_Temp | ||
+ | CPU1_ECC1 | ||
+ | CPU2_ECC1 | ||
+ | ...[snip]... | ||
+ | PMBPower1 | ||
+ | PMBPower2 | ||
+ | ...[snip]... | ||
+ | FRNT_FAN1 | ||
+ | ../ | ||
+ | PSU1 Slow FAN1 | 95h | ok | 0.0 | Transition to OK | ||
+ | PSU2 Slow FAN1 | 9Dh | ok | 0.0 | Transition to OK | ||
+ | ...[snip]... | ||
+ | |||
+ | |||
+ | [root@n89 ~]# | ||
+ | # dmidecode 3.2 | ||
+ | Getting SMBIOS data from sysfs. | ||
+ | SMBIOS 3.2 present. | ||
+ | |||
+ | Handle 0x0000, DMI type 0, 26 bytes | ||
+ | BIOS Information | ||
+ | Vendor: American Megatrends Inc. | ||
+ | Version: 5102 | ||
+ | Release Date: 02/ | ||
+ | Address: 0xF0000 | ||
+ | Runtime Size: 64 kB | ||
+ | ROM Size: 32 MB | ||
+ | Characteristics: | ||
+ | ...[snip]... | ||
+ | UEFI is supported | ||
+ | BIOS Revision: 5.14 | ||
+ | |||
+ | |||
+ | [root@n89 ~]# edac-util -s -v | ||
+ | edac-util: EDAC drivers are loaded. 4 MCs detected: | ||
+ | mc0:Skylake Socket#0 IMC#0 | ||
+ | mc1:Skylake Socket#0 IMC#1 | ||
+ | mc2:Skylake Socket#1 IMC#0 | ||
+ | mc3:Skylake Socket#1 IMC#1 | ||
+ | [root@n89 ~]# edac-util | ||
+ | edac-util: No errors to report. | ||
+ | |||
+ | syslog | ||
+ | |||
+ | </ | ||
+ | ==== Usage ==== | ||
+ | |||
+ | The new queue '' | ||
+ | |||
+ | A new static resource is introduced for all nodes holding gpus. '' | ||
+ | |||
+ | The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''# | ||
+ | |||
+ | The wrappers (n78.mpich3.wrapper for '' | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | # command that shows gpu reservations | ||
+ | bhosts -l n79 | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | # old way of doing that | ||
+ | lsload -l n79 | ||
+ | |||
+ | HOST_NAME | ||
+ | n79 | ||
+ | |||
+ | </ | ||
+ | |||
+ | Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. | ||
+ | |||
+ | < | ||
+ | cpu:gpu | ||
+ | mdout.325288: | ||
+ | mdout.325289: | ||
+ | mdout.326208: | ||
+ | |||
+ | </ | ||
+ | ==== Miscellaneous ==== | ||
+ | |||
+ | Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. | ||
+ | |||
+ | Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource '' | ||
+ | |||
+ | Add nodes to ZenOSS hpcmon. | ||
+ | |||
+ | Propagate global '' | ||
+ | |||
+ | Look at how accounting ties in with resource request '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # propagate global passwd, shadow, group, hosts file | ||
+ | |||
+ | # add to date_ctt2.sh script, get and set date | ||
+ | |||
+ | NOW=`/ | ||
+ | for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done | ||
+ | |||
+ | # crontab | ||
+ | |||
+ | # ionice gaussian | ||
+ | 0,15,30,45 * * * * / | ||
+ | |||
+ | # cpu temps | ||
+ | 40 * * * * / | ||
+ | |||
+ | # rc.local, chmod o+x / | ||
+ | |||
+ | # for mapd, 'All On' enable graphicsrendering support | ||
+ | #/ | ||
+ | |||
+ | # for amber16 -pm=1/ | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=1 | ||
+ | |||
+ | # for mwgpu/exx96 -pm=1/ | ||
+ | # note: turned this off, running with defaults | ||
+ | # seems stable, maybe persistence later on | ||
+ | # lets see how docker interacts first... | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=0 | ||
+ | |||
+ | # turn ECC off (memory scrubbing) | ||
+ | #/ | ||
+ | |||
+ | # lm_sensor | ||
+ | modprobe coretemp | ||
+ | modprobe tmp401 | ||
+ | #modprobe w83627ehf | ||
+ | |||
+ | reboot | ||
+ | |||
+ | </ | ||
==== Recipe ==== | ==== Recipe ==== | ||
Line 61: | Line 217: | ||
# add packages and update | # add packages and update | ||
yum install epel-release -y | yum install epel-release -y | ||
+ | yum install flex flex-devel bison bison-devel -y | ||
yum install tcl tcl-devel dmtcp -y | yum install tcl tcl-devel dmtcp -y | ||
+ | yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y | ||
yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y | ||
yum install blas blas-devel lapack lapack-devel boost boost-devel -y | yum install blas blas-devel lapack lapack-devel boost boost-devel -y | ||
Line 69: | Line 227: | ||
yum install cmake cmake-devel -y | yum install cmake cmake-devel -y | ||
yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y | yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y | ||
+ | # amber | ||
+ | yum -y install tcsh make \ | ||
+ | gcc gcc-gfortran gcc-c++ \ | ||
+ | which flex bison patch bc \ | ||
+ | | ||
+ | perl perl-ExtUtils-MakeMaker util-linux wget \ | ||
+ | bzip2 bzip2-devel zlib-devel tar | ||
yum update -y | yum update -y | ||
yum clean all | yum clean all | ||
Line 140: | Line 305: | ||
nvcr.io/ | nvcr.io/ | ||
- | # free -g | + | free -m |
total used free shared | total used free shared | ||
- | Mem: 92 | + | Mem: |
+ | Swap: | ||
# nvidia-smi | # nvidia-smi | ||
Line 205: | Line 372: | ||
{{: | {{: | ||
- | {{: | + | {{: |
{{: | {{: | ||
{{: | {{: |