This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:192 [2020/02/24 19:21] hmeij07 |
cluster:192 [2020/04/03 13:10] hmeij07 [EXX96] |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== EXX96 ===== | ===== EXX96 ===== | ||
- | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster: | + | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster: |
- | Page best read bottom to top. | + | Page best followed |
+ | The Usage section below is HPCC users wnatig to use queue '' | ||
+ | |||
+ | Debug for node n89 which turns itself off...grrhhh | ||
+ | |||
+ | < | ||
+ | |||
+ | ipmitool sel elist | ||
+ | ipmitool sdr elist | ||
+ | dmidecode -t0 | ||
+ | |||
+ | edac-util: EDAC drivers are loaded. 4 MCs detected: | ||
+ | mc0:Skylake Socket#0 IMC#0 | ||
+ | mc1:Skylake Socket#0 IMC#1 | ||
+ | mc2:Skylake Socket#1 IMC#0 | ||
+ | mc3:Skylake Socket#1 IMC#1 | ||
+ | [root@n89 ~]# edac-util | ||
+ | edac-util: No errors to report. | ||
+ | |||
+ | syslog | ||
+ | |||
+ | </ | ||
+ | ==== Usage ==== | ||
+ | |||
+ | The new queue '' | ||
+ | |||
+ | A new static resource is introduced for all nodes holding gpus. '' | ||
+ | |||
+ | The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''# | ||
+ | |||
+ | The wrappers (n78.mpich3.wrapper for '' | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | # command that shows gpu reservations | ||
+ | bhosts -l n79 | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | # old way of doing that | ||
+ | lsload -l n79 | ||
+ | |||
+ | HOST_NAME | ||
+ | n79 | ||
+ | |||
+ | </ | ||
+ | |||
+ | Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. | ||
+ | |||
+ | < | ||
+ | cpu:gpu | ||
+ | mdout.325288: | ||
+ | mdout.325289: | ||
+ | mdout.326208: | ||
+ | |||
+ | </ | ||
==== Miscellaneous ==== | ==== Miscellaneous ==== | ||
- | Add scheduler, reconfigure, | + | Install |
Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource '' | Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource '' | ||
- | Add nodes to ZenOSS hpcmon | + | Add nodes to ZenOSS hpcmon. |
+ | |||
+ | Propagate global '' | ||
+ | |||
+ | Look at how accounting ties in with resource request '' | ||
< | < | ||
Line 38: | Line 99: | ||
#/ | #/ | ||
- | # for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS | + | # for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS |
#nvidia-smi --persistence-mode=1 | #nvidia-smi --persistence-mode=1 | ||
#nvidia-smi --compute-mode=1 | #nvidia-smi --compute-mode=1 | ||
- | # for mwgpu/exx96 -pm=ENABLED -c=DEFAULT | + | # for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT |
- | nvidia-smi --persistence-mode=1 | + | # note: turned this off, running with defaults |
- | nvidia-smi --compute-mode=0 | + | # seems stable, maybe persistence later on |
+ | # lets see how docker interacts first... | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=0 | ||
# turn ECC off (memory scrubbing) | # turn ECC off (memory scrubbing) | ||
Line 111: | Line 175: | ||
# add packages and update | # add packages and update | ||
yum install epel-release -y | yum install epel-release -y | ||
+ | yum install flex flex-devel bison bison-devel -y | ||
yum install tcl tcl-devel dmtcp -y | yum install tcl tcl-devel dmtcp -y | ||
+ | yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y | ||
yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y | ||
yum install blas blas-devel lapack lapack-devel boost boost-devel -y | yum install blas blas-devel lapack lapack-devel boost boost-devel -y | ||
Line 190: | Line 256: | ||
nvcr.io/ | nvcr.io/ | ||
- | # free -g | + | free -m |
total used free shared | total used free shared | ||
- | Mem: 92 | + | Mem: |
+ | Swap: | ||
# nvidia-smi | # nvidia-smi |