User Tools

Site Tools


cluster:192

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:192 [2020/02/26 13:25]
hmeij07 [EXX96]
cluster:192 [2022/03/08 13:29] (current)
hmeij07 [Recipe]
Line 4: Line 4:
 ===== EXX96 ===== ===== EXX96 =====
  
-A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]+A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] and [[cluster:173|K20 Redo Usage]]
  
-Page best followed bottom to top.+Page best followed bottom to top if interested in the whole process.
  
 +The Usage section below is HPCC users wnatig to use queue ''exx96''.
 +
 +Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https://rufus.ie/ then unzip BIOS and firmware zip files located in ''n89:/usr/local/src''
 +
 +<code>
 +
 +[root@n89 ~]# ipmitool sel elist
 +   1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted
 +   2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted
 +   3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted
 +...[snip]...
 +
 +[root@n89 ~]# ipmitool sdr elist
 +CPU1 Temperature | 31h | ok  |  3.0 | 43 degrees C
 +CPU2 Temperature | 32h | ok  |  0.0 | 40 degrees C
 +PSU1 Over Temp   | 92h | ok  |  0.0 | Transition to OK
 +PSU2 Over Temp   | 9Ah | ok  |  0.0 | Transition to OK
 +...[snip]...
 +DIMMM1_Temp      | E4h | ok  |  3.0 | 28 degrees C
 +CPU1_ECC1        | D1h | ok  |  0.0 | Presence Detected
 +CPU2_ECC1        | D3h | ok  |  0.0 | Presence Detected
 +...[snip]...
 +PMBPower1        | E1h | ok  |  3.0 | 88 Watts
 +PMBPower2        | E2h | ok  |  3.0 | 112 Watts
 +...[snip]...
 +FRNT_FAN1        | A2h | ok  |  0.0 | 3100 RPM
 +../.[snip]...
 +PSU1 Slow FAN1   | 95h | ok  |  0.0 | Transition to OK
 +PSU2 Slow FAN1   | 9Dh | ok  |  0.0 | Transition to OK
 +...[snip]...
 +
 +
 +[root@n89 ~]#dmidecode -t0
 +# dmidecode 3.2
 +Getting SMBIOS data from sysfs.
 +SMBIOS 3.2 present.
 +
 +Handle 0x0000, DMI type 0, 26 bytes
 +BIOS Information
 +        Vendor: American Megatrends Inc.
 +        Version: 5102
 +        Release Date: 02/11/2019
 +        Address: 0xF0000
 +        Runtime Size: 64 kB
 +        ROM Size: 32 MB
 +        Characteristics:
 +...[snip]...
 +                UEFI is supported
 +        BIOS Revision: 5.14
 +
 +
 +[root@n89 ~]# edac-util -s -v
 +edac-util: EDAC drivers are loaded. 4 MCs detected:
 +  mc0:Skylake Socket#0 IMC#0
 +  mc1:Skylake Socket#0 IMC#1
 +  mc2:Skylake Socket#1 IMC#0
 +  mc3:Skylake Socket#1 IMC#1
 +[root@n89 ~]# edac-util
 +edac-util: No errors to report.
 +
 +syslog
 +
 +</code>
 ==== Usage ==== ==== Usage ====
  
-The new queue ''exx96'' will be comprised of nodes ''n79-n90'' Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GD.+The new queue ''exx96'' will be comprised of nodes ''n79-n90'' Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz 12-core cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GB.
  
 +A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' and the nodes mentioned above.  The name of this resource is ''gpu4'' Moving forward please use it instead of ''gpu'' or ''gputest''.
  
 +The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1'' If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx''
 +
 +The wrappers (n78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''.
 + 
  
 <code> <code>
 +
 +# command that shows gpu reservations
 bhosts -l n79 bhosts -l n79
              gputest gpu4              gputest gpu4
  Total                3  Total                3
- Reserved        0.0  0.1+ Reserved        0.0  1.0
  
 +# old way of doing that
 lsload -l n79 lsload -l n79
  
Line 25: Line 96:
 n79                         ok   0.0   0.0   0.0   0%   0.0       0 2e+08  826G   10G   90G    3.0 n79                         ok   0.0   0.0   0.0   0%   0.0       0 2e+08  826G   10G   90G    3.0
  
-mdout.325288: Master Total CPU time:          982.60 seconds     0.27 hours  1:1 +</code>
-mdout.325289: Master Total CPU time:          611.08 seconds     0.17 hours  4:2 +
-mdout.326208: Master Total CPU time:          537.97 seconds     0.15 hours 36:4+
  
-#BSUB -n 4 +Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware.  This will get rather messy in setting up.  Some quick off the cuff performance data reveals some impact. Generally in our environment the gains are not worth the effort.  Using Amber and ''pmemd.cuda.MPI''
-#BSUB -R "rusage[gpu4=2:mem=6288],span[hosts=1]" +
-export CPU_GPU_REQUEST=4:2+
  
-</code>+<code> 
 +                                                                              cpu:gpu 
 +mdout.325288: Master Total CPU time:          982.60 seconds     0.27 hours   1:1 
 +mdout.325289: Master Total CPU time:          611.08 seconds     0.17 hours   4:2 
 +mdout.326208: Master Total CPU time:          537.97 seconds     0.15 hours  36:4 
 + 
 +</code> 
 ==== Miscellaneous ==== ==== Miscellaneous ====
  
Line 68: Line 141:
 #/usr/bin/nvidia-smi --gom=0 #/usr/bin/nvidia-smi --gom=0
  
-# for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS+# for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS
 #nvidia-smi --persistence-mode=1 #nvidia-smi --persistence-mode=1
 #nvidia-smi --compute-mode=1 #nvidia-smi --compute-mode=1
  
-# for mwgpu/exx96 -pm=ENABLED -c=DEFAULT +# for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT 
-nvidia-smi --persistence-mode=1 +# note: turned this off, running with defaults 
-nvidia-smi --compute-mode=0+# seems stable, maybe persistence later on 
 +# lets see how docker interacts first... 
 +#nvidia-smi --persistence-mode=1 
 +#nvidia-smi --compute-mode=0
  
 # turn ECC off (memory scrubbing) # turn ECC off (memory scrubbing)
Line 120: Line 196:
 systemctl restart network systemctl restart network
 dig google.com dig google.com
 +#centos7
 yum install -y iptables-services yum install -y iptables-services
 vi /etc/sysconfig/iptables vi /etc/sysconfig/iptables
Line 141: Line 218:
 # add packages and update # add packages and update
 yum install epel-release -y yum install epel-release -y
 +yum install flex flex-devel bison bison-devel -y 
 yum install tcl tcl-devel dmtcp -y yum install tcl tcl-devel dmtcp -y
 +yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y
 yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y
 yum install blas blas-devel lapack lapack-devel boost boost-devel -y yum install blas blas-devel lapack lapack-devel boost boost-devel -y
Line 149: Line 228:
 yum install cmake cmake-devel -y yum install cmake cmake-devel -y
 yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
 +# amber
 +yum -y install tcsh make \
 +               gcc gcc-gfortran gcc-c++ \
 +               which flex bison patch bc \
 +               libXt-devel libXext-devel \
 +               perl perl-ExtUtils-MakeMaker util-linux wget \
 +               bzip2 bzip2-devel zlib-devel tar 
 yum update -y yum update -y
 yum clean all yum clean all
Line 220: Line 306:
 nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB
  
-free -g+free -m
               total        used        free      shared  buff/cache   available               total        used        free      shared  buff/cache   available
-Mem:             92                    88           0           1          89+Mem:          95056        1919       85338          20        7798       92571 
 +Swap:         10239           0       10239 
  
 # nvidia-smi # nvidia-smi
cluster/192.1582741531.txt.gz · Last modified: 2020/02/26 13:25 by hmeij07