\\
**[[cluster:0|Back]]**
==== Infiniband Monitoring ====
The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ''ibswinfo'' which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a **SB7890** EDR. You will also need
* infiniband-diags
* awk, sed, coreutils,
But first lets download MFT for Linux x86 as an rpm package and install. Use the ''install.sh'' with no arguments.
* https://network.nvidia.com/products/adapter-software/firmware-tools/
This is hairier than it looks. MFT will write as root to ''/dev'' and build kernel modules so make sure you have a backup. It will also pull down a suite of packages. So rather than do this on a storage server or head node I will install it on a compute node connected to the switch.
yumdownloader --destdir=`pwd` \
annobin dwz efi-srpm-macros elfutils gc gcc-plugin-annobin \
gdb-headless ghc-srpm-macros go-srpm-macros guile libatomic_ops \
libbabeltrace libipt ocaml-srpm-macros openblas-srpm-macros \
patch perl-srpm-macros python-rpm-macros python-srpm-macros \
python3-rpm-macros qt5-srpm-macros redhat-rpm-config \
rust-srpm-macros zlib-devel zstd elfutils-libelf-devel \
python3-rpm-generators-5-7.el8.noarch.rpm
rm -f *i686.rpm
# copy to n102 and install as root
======================================================================================================
Package Architecture Version Repository Size
======================================================================================================
Installing:
kernel-devel x86_64 4.18.0-425.3.1.el8 baseos 22 M
make x86_64 1:4.2.1-11.el8 baseos 497 k
rpm-build x86_64 4.14.3-24.el8_7 appstream 173 k
Installing dependencies:
annobin x86_64 10.67-3.el8 appstream 954 k
dwz x86_64 0.12-10.el8 appstream 108 k
efi-srpm-macros noarch 3-3.el8 appstream 21 k
elfutils x86_64 0.187-4.el8 baseos 542 k
gc x86_64 7.6.4-3.el8 appstream 108 k
gcc-plugin-annobin x86_64 8.5.0-15.el8 appstream 34 k
gdb-headless x86_64 8.2-19.el8 appstream 3.7 M
ghc-srpm-macros noarch 1.4.2-7.el8 appstream 8.3 k
go-srpm-macros noarch 2-17.el8 appstream 12 k
guile x86_64 5:2.0.14-7.el8 appstream 3.5 M
libatomic_ops x86_64 7.6.2-3.el8 appstream 37 k
libbabeltrace x86_64 1.5.4-4.el8 baseos 199 k
libipt x86_64 1.6.1-8.el8 appstream 49 k
ocaml-srpm-macros noarch 5-4.el8 appstream 8.3 k
openblas-srpm-macros noarch 2-2.el8 appstream 6.9 k
patch x86_64 2.7.6-11.el8 baseos 137 k
perl-srpm-macros noarch 1-25.el8 appstream 9.7 k
python-rpm-macros noarch 3-43.el8 appstream 15 k
python-srpm-macros noarch 3-43.el8 appstream 14 k
python3-rpm-macros noarch 3-43.el8 appstream 14 k
python3-rpm-generators noarch 5-7.el8 appstream 14k
qt5-srpm-macros noarch 5.15.3-1.el8 appstream 9.5 k
redhat-rpm-config noarch 130-1.el8 appstream 89 k
rust-srpm-macros noarch 5-2.el8 appstream 8.2 k
zlib-devel x86_64 1.2.11-20.el8 baseos 57 k
zstd x86_64 1.4.4-1.el8 appstream 392 k
Installing weak dependencies:
elfutils-libelf-devel x86_64 0.187-4.el8 baseos 60 k
[root@n102 mft-4.23.0-104-x86_64-rpm]# ./install.sh
-I- Removing any old MFT file if exists...
-I- Building the MFT kernel binary RPM...
-I- Installing the MFT RPMs...
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:kernel-mft-4.23.0-4.18.0_425.3.1.################################# [100%]
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:mft-4.23.0-104 ################################# [100%]
-I- In order to start mst, please run "mst start".
[root@n102 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
[root@n102 ~]# mst ib add
-I- Discovering the fabric - Running: ibnetdiscover
-I- Added 8 in-band devices
[root@n102 ~]# mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:31:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
Inband devices:
-------------------
/dev/mst/CA_MT4119_astrostore_mlx5_0_lid-0x0001
/dev/mst/CA_MT4119_n102_mlx5_0_lid-0x0005 <--- node names
/dev/mst/CA_MT4119_n103_mlx5_0_lid-0x0007
/dev/mst/CA_MT4119_n104_mlx5_0_lid-0x0004
/dev/mst/CA_MT4119_n105_mlx5_0_lid-0x0002
/dev/mst/CA_MT4119_n106_mlx5_0_lid-0x0006
/dev/mst/CA_MT4119_n107_mlx5_0_lid-0x0008
/dev/mst/SW_MT53000_SwitchIB_lid-0x0003
Ok, so onward to stage ''ibswinfo.sh'' from https://github.com/stanford-rc/ibswinfo
Download the script and stage in ''/usr/bin''
Probe...
[root@n102 ~]# ibswinfo.sh -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
=================================================
SwitchIB Mellanox Technologies
=================================================
part number | MSB7890-ES2F
serial number | MT2239XZ011W
product name | Scorpion2 IB EDR Unmanaged
revision | AC
ports | 36
PSID | MT_2640110032
GUID | 0x900a840300ecde60
firmware version | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s) | 33d-21:11:22
-------------------------------------------------
PSU0 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MYR
DC power | ERROR
fan status | ERROR
PSU1 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MZ2
DC power | OK
fan status | OK
power (W) | 50
-------------------------------------------------
temperature (C) | 38
max temp (C) | 45
-------------------------------------------------
fan status | ERROR
fan#1 (rpm) | 8441
fan#2 (rpm) | 7156
fan#3 (rpm) | 8389
fan#4 (rpm) | 7270
fan#5 (rpm) | 8337
fan#6 (rpm) | 7194
fan#7 (rpm) | 8389
fan#8 (rpm) | 7156
-------------------------------------------------
Looks like I don't have both power units plugged in Will have to check next time I'm in.
Other useful commands...
[root@n102 ~]# ibnodes
Ca : 0xb83fd2030063fc88 ports 1 "n107 mlx5_0"
Ca : 0xb83fd2030063f8a4 ports 1 "n106 mlx5_0"
Ca : 0xb83fd2030063fb5c ports 1 "n105 mlx5_0"
Ca : 0xb83fd2030063f88c ports 1 "n104 mlx5_0"
Ca : 0xb83fd2030063faa4 ports 1 "astrostore mlx5_0"
Ca : 0xb83fd2030063fac8 ports 1 "n103 mlx5_0"
Ca : 0xb83fd2030063fca0 ports 1 "n102 mlx5_0"
Switch : 0x900a840300ecde60 ports 37 "SwitchIB Mellanox Technologies" base port 0 lid 3 lmc 0
[root@n102 ~]# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:b83f:d203:0063:fca0
base lid: 0x5
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
Under load with full power...
ibswinfo -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
=================================================
SwitchIB Mellanox Technologies
=================================================
part number | MSB7890-ES2F
serial number | MT2239XZ011W
product name | Scorpion2 IB EDR Unmanaged
revision | AC
ports | 36
PSID | MT_2640110032
GUID | 0x900a840300ecde60
firmware version | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s) | 46d-21:14:46
-------------------------------------------------
PSU0 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MYR
DC power | OK
fan status | OK
power (W) | 27 <--- 27+32=59 units rated typical 122, max 162
PSU1 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MZ2
DC power | OK
fan status | OK
power (W) | 32
-------------------------------------------------
temperature (C) | 39 <--- one degree higher
max temp (C) | 45
-------------------------------------------------
fan status | OK <--- speeds about the same
fan#1 (rpm) | 8337
fan#2 (rpm) | 7194
fan#3 (rpm) | 8287
fan#4 (rpm) | 7045
fan#5 (rpm) | 8389
fan#6 (rpm) | 7194
fan#7 (rpm) | 8441
fan#8 (rpm) | 7156
-------------------------------------------------
\\
**[[cluster:0|Back]]**