The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ibswinfo
which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a SB7890 EDR. You will also need
But first lets download MFT for Linux x86 as an rpm package and install. Use the install.sh
with no arguments.
This is hairier than it looks. MFT will write as root to /dev
and build kernel modules so make sure you have a backup. It will also pull down a suite of packages. So rather than do this on a storage server or head node I will install it on a compute node connected to the switch.
yumdownloader --destdir=`pwd` \ annobin dwz efi-srpm-macros elfutils gc gcc-plugin-annobin \ gdb-headless ghc-srpm-macros go-srpm-macros guile libatomic_ops \ libbabeltrace libipt ocaml-srpm-macros openblas-srpm-macros \ patch perl-srpm-macros python-rpm-macros python-srpm-macros \ python3-rpm-macros qt5-srpm-macros redhat-rpm-config \ rust-srpm-macros zlib-devel zstd elfutils-libelf-devel \ python3-rpm-generators-5-7.el8.noarch.rpm rm -f *i686.rpm # copy to n102 and install as root ====================================================================================================== Package Architecture Version Repository Size ====================================================================================================== Installing: kernel-devel x86_64 4.18.0-425.3.1.el8 baseos 22 M make x86_64 1:4.2.1-11.el8 baseos 497 k rpm-build x86_64 4.14.3-24.el8_7 appstream 173 k Installing dependencies: annobin x86_64 10.67-3.el8 appstream 954 k dwz x86_64 0.12-10.el8 appstream 108 k efi-srpm-macros noarch 3-3.el8 appstream 21 k elfutils x86_64 0.187-4.el8 baseos 542 k gc x86_64 7.6.4-3.el8 appstream 108 k gcc-plugin-annobin x86_64 8.5.0-15.el8 appstream 34 k gdb-headless x86_64 8.2-19.el8 appstream 3.7 M ghc-srpm-macros noarch 1.4.2-7.el8 appstream 8.3 k go-srpm-macros noarch 2-17.el8 appstream 12 k guile x86_64 5:2.0.14-7.el8 appstream 3.5 M libatomic_ops x86_64 7.6.2-3.el8 appstream 37 k libbabeltrace x86_64 1.5.4-4.el8 baseos 199 k libipt x86_64 1.6.1-8.el8 appstream 49 k ocaml-srpm-macros noarch 5-4.el8 appstream 8.3 k openblas-srpm-macros noarch 2-2.el8 appstream 6.9 k patch x86_64 2.7.6-11.el8 baseos 137 k perl-srpm-macros noarch 1-25.el8 appstream 9.7 k python-rpm-macros noarch 3-43.el8 appstream 15 k python-srpm-macros noarch 3-43.el8 appstream 14 k python3-rpm-macros noarch 3-43.el8 appstream 14 k python3-rpm-generators noarch 5-7.el8 appstream 14k qt5-srpm-macros noarch 5.15.3-1.el8 appstream 9.5 k redhat-rpm-config noarch 130-1.el8 appstream 89 k rust-srpm-macros noarch 5-2.el8 appstream 8.2 k zlib-devel x86_64 1.2.11-20.el8 baseos 57 k zstd x86_64 1.4.4-1.el8 appstream 392 k Installing weak dependencies: elfutils-libelf-devel x86_64 0.187-4.el8 baseos 60 k [root@n102 mft-4.23.0-104-x86_64-rpm]# ./install.sh -I- Removing any old MFT file if exists... -I- Building the MFT kernel binary RPM... -I- Installing the MFT RPMs... Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:kernel-mft-4.23.0-4.18.0_425.3.1.################################# [100%] Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:mft-4.23.0-104 ################################# [100%] -I- In order to start mst, please run "mst start". [root@n102 ~]# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success Loading MST PCI configuration module - Success Create devices Unloading MST PCI module (unused) - Success [root@n102 ~]# mst ib add -I- Discovering the fabric - Running: ibnetdiscover -I- Added 8 in-band devices [root@n102 ~]# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded MST devices: ------------ /dev/mst/mt4119_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:31:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 Inband devices: ------------------- /dev/mst/CA_MT4119_astrostore_mlx5_0_lid-0x0001 /dev/mst/CA_MT4119_n102_mlx5_0_lid-0x0005 <--- node names /dev/mst/CA_MT4119_n103_mlx5_0_lid-0x0007 /dev/mst/CA_MT4119_n104_mlx5_0_lid-0x0004 /dev/mst/CA_MT4119_n105_mlx5_0_lid-0x0002 /dev/mst/CA_MT4119_n106_mlx5_0_lid-0x0006 /dev/mst/CA_MT4119_n107_mlx5_0_lid-0x0008 /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
Ok, so onward to stage ibswinfo.sh
from https://github.com/stanford-rc/ibswinfo
Download the script and stage in /usr/bin
Probe…
[root@n102 ~]# ibswinfo.sh -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 ================================================= SwitchIB Mellanox Technologies ================================================= part number | MSB7890-ES2F serial number | MT2239XZ011W product name | Scorpion2 IB EDR Unmanaged revision | AC ports | 36 PSID | MT_2640110032 GUID | 0x900a840300ecde60 firmware version | 15.2008.2102 ------------------------------------------------- uptime (d-h:m:s) | 33d-21:11:22 ------------------------------------------------- PSU0 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MYR DC power | ERROR fan status | ERROR PSU1 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MZ2 DC power | OK fan status | OK power (W) | 50 ------------------------------------------------- temperature (C) | 38 max temp (C) | 45 ------------------------------------------------- fan status | ERROR fan#1 (rpm) | 8441 fan#2 (rpm) | 7156 fan#3 (rpm) | 8389 fan#4 (rpm) | 7270 fan#5 (rpm) | 8337 fan#6 (rpm) | 7194 fan#7 (rpm) | 8389 fan#8 (rpm) | 7156 -------------------------------------------------
Looks like I don't have both power units plugged in Will have to check next time I'm in.
Other useful commands…
[root@n102 ~]# ibnodes Ca : 0xb83fd2030063fc88 ports 1 "n107 mlx5_0" Ca : 0xb83fd2030063f8a4 ports 1 "n106 mlx5_0" Ca : 0xb83fd2030063fb5c ports 1 "n105 mlx5_0" Ca : 0xb83fd2030063f88c ports 1 "n104 mlx5_0" Ca : 0xb83fd2030063faa4 ports 1 "astrostore mlx5_0" Ca : 0xb83fd2030063fac8 ports 1 "n103 mlx5_0" Ca : 0xb83fd2030063fca0 ports 1 "n102 mlx5_0" Switch : 0x900a840300ecde60 ports 37 "SwitchIB Mellanox Technologies" base port 0 lid 3 lmc 0 [root@n102 ~]# ibstatus Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:b83f:d203:0063:fca0 base lid: 0x5 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand
Under load with full power…
ibswinfo -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 ================================================= SwitchIB Mellanox Technologies ================================================= part number | MSB7890-ES2F serial number | MT2239XZ011W product name | Scorpion2 IB EDR Unmanaged revision | AC ports | 36 PSID | MT_2640110032 GUID | 0x900a840300ecde60 firmware version | 15.2008.2102 ------------------------------------------------- uptime (d-h:m:s) | 46d-21:14:46 ------------------------------------------------- PSU0 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MYR DC power | OK fan status | OK power (W) | 27 <--- 27+32=59 units rated typical 122, max 162 PSU1 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MZ2 DC power | OK fan status | OK power (W) | 32 ------------------------------------------------- temperature (C) | 39 <--- one degree higher max temp (C) | 45 ------------------------------------------------- fan status | OK <--- speeds about the same fan#1 (rpm) | 8337 fan#2 (rpm) | 7194 fan#3 (rpm) | 8287 fan#4 (rpm) | 7045 fan#5 (rpm) | 8389 fan#6 (rpm) | 7194 fan#7 (rpm) | 8441 fan#8 (rpm) | 7156 -------------------------------------------------