The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ibswinfo which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a SB7890 EDR. You will also need
But first lets download MFT for Linux x86 as an rpm package and install. Use the install.sh with no arguments.
This is hairier than it looks. MFT will write as root to /dev and build kernel modules so make sure you have a backup. It will also pull down a suite of packages. So rather than do this on a storage server or head node I will install it on a compute node connected to the switch.
yumdownloader --destdir=`pwd` \
annobin dwz efi-srpm-macros elfutils gc gcc-plugin-annobin \
gdb-headless ghc-srpm-macros go-srpm-macros guile libatomic_ops \
libbabeltrace libipt ocaml-srpm-macros openblas-srpm-macros \
patch perl-srpm-macros python-rpm-macros python-srpm-macros \
python3-rpm-macros qt5-srpm-macros redhat-rpm-config \
rust-srpm-macros zlib-devel zstd elfutils-libelf-devel \
python3-rpm-generators-5-7.el8.noarch.rpm
rm -f *i686.rpm
# copy to n102 and install as root
======================================================================================================
Package Architecture Version Repository Size
======================================================================================================
Installing:
kernel-devel x86_64 4.18.0-425.3.1.el8 baseos 22 M
make x86_64 1:4.2.1-11.el8 baseos 497 k
rpm-build x86_64 4.14.3-24.el8_7 appstream 173 k
Installing dependencies:
annobin x86_64 10.67-3.el8 appstream 954 k
dwz x86_64 0.12-10.el8 appstream 108 k
efi-srpm-macros noarch 3-3.el8 appstream 21 k
elfutils x86_64 0.187-4.el8 baseos 542 k
gc x86_64 7.6.4-3.el8 appstream 108 k
gcc-plugin-annobin x86_64 8.5.0-15.el8 appstream 34 k
gdb-headless x86_64 8.2-19.el8 appstream 3.7 M
ghc-srpm-macros noarch 1.4.2-7.el8 appstream 8.3 k
go-srpm-macros noarch 2-17.el8 appstream 12 k
guile x86_64 5:2.0.14-7.el8 appstream 3.5 M
libatomic_ops x86_64 7.6.2-3.el8 appstream 37 k
libbabeltrace x86_64 1.5.4-4.el8 baseos 199 k
libipt x86_64 1.6.1-8.el8 appstream 49 k
ocaml-srpm-macros noarch 5-4.el8 appstream 8.3 k
openblas-srpm-macros noarch 2-2.el8 appstream 6.9 k
patch x86_64 2.7.6-11.el8 baseos 137 k
perl-srpm-macros noarch 1-25.el8 appstream 9.7 k
python-rpm-macros noarch 3-43.el8 appstream 15 k
python-srpm-macros noarch 3-43.el8 appstream 14 k
python3-rpm-macros noarch 3-43.el8 appstream 14 k
python3-rpm-generators noarch 5-7.el8 appstream 14k
qt5-srpm-macros noarch 5.15.3-1.el8 appstream 9.5 k
redhat-rpm-config noarch 130-1.el8 appstream 89 k
rust-srpm-macros noarch 5-2.el8 appstream 8.2 k
zlib-devel x86_64 1.2.11-20.el8 baseos 57 k
zstd x86_64 1.4.4-1.el8 appstream 392 k
Installing weak dependencies:
elfutils-libelf-devel x86_64 0.187-4.el8 baseos 60 k
[root@n102 mft-4.23.0-104-x86_64-rpm]# ./install.sh
-I- Removing any old MFT file if exists...
-I- Building the MFT kernel binary RPM...
-I- Installing the MFT RPMs...
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:kernel-mft-4.23.0-4.18.0_425.3.1.################################# [100%]
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:mft-4.23.0-104 ################################# [100%]
-I- In order to start mst, please run "mst start".
[root@n102 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
[root@n102 ~]# mst ib add
-I- Discovering the fabric - Running: ibnetdiscover
-I- Added 8 in-band devices
[root@n102 ~]# mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:31:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
Inband devices:
-------------------
/dev/mst/CA_MT4119_astrostore_mlx5_0_lid-0x0001
/dev/mst/CA_MT4119_n102_mlx5_0_lid-0x0005 <--- node names
/dev/mst/CA_MT4119_n103_mlx5_0_lid-0x0007
/dev/mst/CA_MT4119_n104_mlx5_0_lid-0x0004
/dev/mst/CA_MT4119_n105_mlx5_0_lid-0x0002
/dev/mst/CA_MT4119_n106_mlx5_0_lid-0x0006
/dev/mst/CA_MT4119_n107_mlx5_0_lid-0x0008
/dev/mst/SW_MT53000_SwitchIB_lid-0x0003
Ok, so onward to stage ibswinfo.sh from https://github.com/stanford-rc/ibswinfo
Download the script and stage in /usr/bin
Probe…
[root@n102 ~]# ibswinfo.sh -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
=================================================
SwitchIB Mellanox Technologies
=================================================
part number | MSB7890-ES2F
serial number | MT2239XZ011W
product name | Scorpion2 IB EDR Unmanaged
revision | AC
ports | 36
PSID | MT_2640110032
GUID | 0x900a840300ecde60
firmware version | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s) | 33d-21:11:22
-------------------------------------------------
PSU0 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MYR
DC power | ERROR
fan status | ERROR
PSU1 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MZ2
DC power | OK
fan status | OK
power (W) | 50
-------------------------------------------------
temperature (C) | 38
max temp (C) | 45
-------------------------------------------------
fan status | ERROR
fan#1 (rpm) | 8441
fan#2 (rpm) | 7156
fan#3 (rpm) | 8389
fan#4 (rpm) | 7270
fan#5 (rpm) | 8337
fan#6 (rpm) | 7194
fan#7 (rpm) | 8389
fan#8 (rpm) | 7156
-------------------------------------------------
Looks like I don't have both power units plugged in Will have to check next time I'm in.
Other useful commands…
[root@n102 ~]# ibnodes
Ca : 0xb83fd2030063fc88 ports 1 "n107 mlx5_0"
Ca : 0xb83fd2030063f8a4 ports 1 "n106 mlx5_0"
Ca : 0xb83fd2030063fb5c ports 1 "n105 mlx5_0"
Ca : 0xb83fd2030063f88c ports 1 "n104 mlx5_0"
Ca : 0xb83fd2030063faa4 ports 1 "astrostore mlx5_0"
Ca : 0xb83fd2030063fac8 ports 1 "n103 mlx5_0"
Ca : 0xb83fd2030063fca0 ports 1 "n102 mlx5_0"
Switch : 0x900a840300ecde60 ports 37 "SwitchIB Mellanox Technologies" base port 0 lid 3 lmc 0
[root@n102 ~]# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:b83f:d203:0063:fca0
base lid: 0x5
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
Under load with full power…
ibswinfo -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
=================================================
SwitchIB Mellanox Technologies
=================================================
part number | MSB7890-ES2F
serial number | MT2239XZ011W
product name | Scorpion2 IB EDR Unmanaged
revision | AC
ports | 36
PSID | MT_2640110032
GUID | 0x900a840300ecde60
firmware version | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s) | 46d-21:14:46
-------------------------------------------------
PSU0 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MYR
DC power | OK
fan status | OK
power (W) | 27 <--- 27+32=59 units rated typical 122, max 162
PSU1 status | OK
P/N | MTEF-PSF-AC-I
S/N | MT2238XZ0MZ2
DC power | OK
fan status | OK
power (W) | 32
-------------------------------------------------
temperature (C) | 39 <--- one degree higher
max temp (C) | 45
-------------------------------------------------
fan status | OK <--- speeds about the same
fan#1 (rpm) | 8337
fan#2 (rpm) | 7194
fan#3 (rpm) | 8287
fan#4 (rpm) | 7045
fan#5 (rpm) | 8389
fan#6 (rpm) | 7194
fan#7 (rpm) | 8441
fan#8 (rpm) | 7156
-------------------------------------------------