User Tools

Site Tools


cluster:192

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:192 [2020/02/26 20:05]
hmeij07 [EXX96]
cluster:192 [2022/03/08 18:29]
hmeij07 [Recipe]
Line 4: Line 4:
 ===== EXX96 ===== ===== EXX96 =====
  
-A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]+A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] and [[cluster:173|K20 Redo Usage]]
  
-Page best followed bottom to top.+Page best followed bottom to top if interested in the whole process.
  
 +The Usage section below is HPCC users wnatig to use queue ''exx96''.
 +
 +Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https://rufus.ie/ then unzip BIOS and firmware zip files located in ''n89:/usr/local/src''
 +
 +<code>
 +
 +[root@n89 ~]# ipmitool sel elist
 +   1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted
 +   2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted
 +   3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted
 +...[snip]...
 +
 +[root@n89 ~]# ipmitool sdr elist
 +CPU1 Temperature | 31h | ok  |  3.0 | 43 degrees C
 +CPU2 Temperature | 32h | ok  |  0.0 | 40 degrees C
 +PSU1 Over Temp   | 92h | ok  |  0.0 | Transition to OK
 +PSU2 Over Temp   | 9Ah | ok  |  0.0 | Transition to OK
 +...[snip]...
 +DIMMM1_Temp      | E4h | ok  |  3.0 | 28 degrees C
 +CPU1_ECC1        | D1h | ok  |  0.0 | Presence Detected
 +CPU2_ECC1        | D3h | ok  |  0.0 | Presence Detected
 +...[snip]...
 +PMBPower1        | E1h | ok  |  3.0 | 88 Watts
 +PMBPower2        | E2h | ok  |  3.0 | 112 Watts
 +...[snip]...
 +FRNT_FAN1        | A2h | ok  |  0.0 | 3100 RPM
 +../.[snip]...
 +PSU1 Slow FAN1   | 95h | ok  |  0.0 | Transition to OK
 +PSU2 Slow FAN1   | 9Dh | ok  |  0.0 | Transition to OK
 +...[snip]...
 +
 +
 +[root@n89 ~]#dmidecode -t0
 +# dmidecode 3.2
 +Getting SMBIOS data from sysfs.
 +SMBIOS 3.2 present.
 +
 +Handle 0x0000, DMI type 0, 26 bytes
 +BIOS Information
 +        Vendor: American Megatrends Inc.
 +        Version: 5102
 +        Release Date: 02/11/2019
 +        Address: 0xF0000
 +        Runtime Size: 64 kB
 +        ROM Size: 32 MB
 +        Characteristics:
 +...[snip]...
 +                UEFI is supported
 +        BIOS Revision: 5.14
 +
 +
 +[root@n89 ~]# edac-util -s -v
 +edac-util: EDAC drivers are loaded. 4 MCs detected:
 +  mc0:Skylake Socket#0 IMC#0
 +  mc1:Skylake Socket#0 IMC#1
 +  mc2:Skylake Socket#1 IMC#0
 +  mc3:Skylake Socket#1 IMC#1
 +[root@n89 ~]# edac-util
 +edac-util: No errors to report.
 +
 +syslog
 +
 +</code>
 ==== Usage ==== ==== Usage ====
  
Line 16: Line 79:
 The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1'' If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx'' The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1'' If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx''
  
-The wrappers (78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''.+The wrappers (n78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''.
    
  
Line 78: Line 141:
 #/usr/bin/nvidia-smi --gom=0 #/usr/bin/nvidia-smi --gom=0
  
-# for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS+# for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS
 #nvidia-smi --persistence-mode=1 #nvidia-smi --persistence-mode=1
 #nvidia-smi --compute-mode=1 #nvidia-smi --compute-mode=1
  
-# for mwgpu/exx96 -pm=ENABLED -c=DEFAULT +# for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT 
-nvidia-smi --persistence-mode=1 +# note: turned this off, running with defaults 
-nvidia-smi --compute-mode=0+# seems stable, maybe persistence later on 
 +# lets see how docker interacts first... 
 +#nvidia-smi --persistence-mode=1 
 +#nvidia-smi --compute-mode=0
  
 # turn ECC off (memory scrubbing) # turn ECC off (memory scrubbing)
Line 130: Line 196:
 systemctl restart network systemctl restart network
 dig google.com dig google.com
 +#centos7
 yum install -y iptables-services yum install -y iptables-services
 vi /etc/sysconfig/iptables vi /etc/sysconfig/iptables
Line 151: Line 218:
 # add packages and update # add packages and update
 yum install epel-release -y yum install epel-release -y
 +yum install flex flex-devel bison bison-devel -y 
 yum install tcl tcl-devel dmtcp -y yum install tcl tcl-devel dmtcp -y
 +yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y
 yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y
 yum install blas blas-devel lapack lapack-devel boost boost-devel -y yum install blas blas-devel lapack lapack-devel boost boost-devel -y
Line 159: Line 228:
 yum install cmake cmake-devel -y yum install cmake cmake-devel -y
 yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
 +# amber
 +yum -y install tcsh make \
 +               gcc gcc-gfortran gcc-c++ \
 +               which flex bison patch bc \
 +               libXt-devel libXext-devel \
 +               perl perl-ExtUtils-MakeMaker util-linux wget \
 +               bzip2 bzip2-devel zlib-devel tar 
 yum update -y yum update -y
 yum clean all yum clean all
Line 230: Line 306:
 nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB nvcr.io/nvidia/rapidsai/rapidsai   0.9-cuda10.0-runtime-centos7   22b5dc2f7e84        5 months ago        5.84GB
  
-free -g+free -m
               total        used        free      shared  buff/cache   available               total        used        free      shared  buff/cache   available
-Mem:             92                    88           0           1          89+Mem:          95056        1919       85338          20        7798       92571 
 +Swap:         10239           0       10239 
  
 # nvidia-smi # nvidia-smi
cluster/192.txt · Last modified: 2022/03/08 18:29 by hmeij07