\\ **[[cluster:0|Back]]** ==== Infiniband Monitoring ==== The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ''ibswinfo'' which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a **SB7890** EDR. You will also need * infiniband-diags * awk, sed, coreutils, But first lets download MFT for Linux x86 as an rpm package and install. Use the ''install.sh'' with no arguments. * https://network.nvidia.com/products/adapter-software/firmware-tools/ This is hairier than it looks. MFT will write as root to ''/dev'' and build kernel modules so make sure you have a backup. It will also pull down a suite of packages. So rather than do this on a storage server or head node I will install it on a compute node connected to the switch. yumdownloader --destdir=`pwd` \ annobin dwz efi-srpm-macros elfutils gc gcc-plugin-annobin \ gdb-headless ghc-srpm-macros go-srpm-macros guile libatomic_ops \ libbabeltrace libipt ocaml-srpm-macros openblas-srpm-macros \ patch perl-srpm-macros python-rpm-macros python-srpm-macros \ python3-rpm-macros qt5-srpm-macros redhat-rpm-config \ rust-srpm-macros zlib-devel zstd elfutils-libelf-devel \ python3-rpm-generators-5-7.el8.noarch.rpm rm -f *i686.rpm # copy to n102 and install as root ====================================================================================================== Package Architecture Version Repository Size ====================================================================================================== Installing: kernel-devel x86_64 4.18.0-425.3.1.el8 baseos 22 M make x86_64 1:4.2.1-11.el8 baseos 497 k rpm-build x86_64 4.14.3-24.el8_7 appstream 173 k Installing dependencies: annobin x86_64 10.67-3.el8 appstream 954 k dwz x86_64 0.12-10.el8 appstream 108 k efi-srpm-macros noarch 3-3.el8 appstream 21 k elfutils x86_64 0.187-4.el8 baseos 542 k gc x86_64 7.6.4-3.el8 appstream 108 k gcc-plugin-annobin x86_64 8.5.0-15.el8 appstream 34 k gdb-headless x86_64 8.2-19.el8 appstream 3.7 M ghc-srpm-macros noarch 1.4.2-7.el8 appstream 8.3 k go-srpm-macros noarch 2-17.el8 appstream 12 k guile x86_64 5:2.0.14-7.el8 appstream 3.5 M libatomic_ops x86_64 7.6.2-3.el8 appstream 37 k libbabeltrace x86_64 1.5.4-4.el8 baseos 199 k libipt x86_64 1.6.1-8.el8 appstream 49 k ocaml-srpm-macros noarch 5-4.el8 appstream 8.3 k openblas-srpm-macros noarch 2-2.el8 appstream 6.9 k patch x86_64 2.7.6-11.el8 baseos 137 k perl-srpm-macros noarch 1-25.el8 appstream 9.7 k python-rpm-macros noarch 3-43.el8 appstream 15 k python-srpm-macros noarch 3-43.el8 appstream 14 k python3-rpm-macros noarch 3-43.el8 appstream 14 k python3-rpm-generators noarch 5-7.el8 appstream 14k qt5-srpm-macros noarch 5.15.3-1.el8 appstream 9.5 k redhat-rpm-config noarch 130-1.el8 appstream 89 k rust-srpm-macros noarch 5-2.el8 appstream 8.2 k zlib-devel x86_64 1.2.11-20.el8 baseos 57 k zstd x86_64 1.4.4-1.el8 appstream 392 k Installing weak dependencies: elfutils-libelf-devel x86_64 0.187-4.el8 baseos 60 k [root@n102 mft-4.23.0-104-x86_64-rpm]# ./install.sh -I- Removing any old MFT file if exists... -I- Building the MFT kernel binary RPM... -I- Installing the MFT RPMs... Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:kernel-mft-4.23.0-4.18.0_425.3.1.################################# [100%] Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:mft-4.23.0-104 ################################# [100%] -I- In order to start mst, please run "mst start". [root@n102 ~]# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success Loading MST PCI configuration module - Success Create devices Unloading MST PCI module (unused) - Success [root@n102 ~]# mst ib add -I- Discovering the fabric - Running: ibnetdiscover -I- Added 8 in-band devices [root@n102 ~]# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded MST devices: ------------ /dev/mst/mt4119_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:31:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 Inband devices: ------------------- /dev/mst/CA_MT4119_astrostore_mlx5_0_lid-0x0001 /dev/mst/CA_MT4119_n102_mlx5_0_lid-0x0005 <--- node names /dev/mst/CA_MT4119_n103_mlx5_0_lid-0x0007 /dev/mst/CA_MT4119_n104_mlx5_0_lid-0x0004 /dev/mst/CA_MT4119_n105_mlx5_0_lid-0x0002 /dev/mst/CA_MT4119_n106_mlx5_0_lid-0x0006 /dev/mst/CA_MT4119_n107_mlx5_0_lid-0x0008 /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 Ok, so onward to stage ''ibswinfo.sh'' from https://github.com/stanford-rc/ibswinfo Download the script and stage in ''/usr/bin'' Probe... [root@n102 ~]# ibswinfo.sh -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 ================================================= SwitchIB Mellanox Technologies ================================================= part number | MSB7890-ES2F serial number | MT2239XZ011W product name | Scorpion2 IB EDR Unmanaged revision | AC ports | 36 PSID | MT_2640110032 GUID | 0x900a840300ecde60 firmware version | 15.2008.2102 ------------------------------------------------- uptime (d-h:m:s) | 33d-21:11:22 ------------------------------------------------- PSU0 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MYR DC power | ERROR fan status | ERROR PSU1 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MZ2 DC power | OK fan status | OK power (W) | 50 ------------------------------------------------- temperature (C) | 38 max temp (C) | 45 ------------------------------------------------- fan status | ERROR fan#1 (rpm) | 8441 fan#2 (rpm) | 7156 fan#3 (rpm) | 8389 fan#4 (rpm) | 7270 fan#5 (rpm) | 8337 fan#6 (rpm) | 7194 fan#7 (rpm) | 8389 fan#8 (rpm) | 7156 ------------------------------------------------- Looks like I don't have both power units plugged in Will have to check next time I'm in. Other useful commands... [root@n102 ~]# ibnodes Ca : 0xb83fd2030063fc88 ports 1 "n107 mlx5_0" Ca : 0xb83fd2030063f8a4 ports 1 "n106 mlx5_0" Ca : 0xb83fd2030063fb5c ports 1 "n105 mlx5_0" Ca : 0xb83fd2030063f88c ports 1 "n104 mlx5_0" Ca : 0xb83fd2030063faa4 ports 1 "astrostore mlx5_0" Ca : 0xb83fd2030063fac8 ports 1 "n103 mlx5_0" Ca : 0xb83fd2030063fca0 ports 1 "n102 mlx5_0" Switch : 0x900a840300ecde60 ports 37 "SwitchIB Mellanox Technologies" base port 0 lid 3 lmc 0 [root@n102 ~]# ibstatus Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:b83f:d203:0063:fca0 base lid: 0x5 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand Under load with full power... ibswinfo -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 ================================================= SwitchIB Mellanox Technologies ================================================= part number | MSB7890-ES2F serial number | MT2239XZ011W product name | Scorpion2 IB EDR Unmanaged revision | AC ports | 36 PSID | MT_2640110032 GUID | 0x900a840300ecde60 firmware version | 15.2008.2102 ------------------------------------------------- uptime (d-h:m:s) | 46d-21:14:46 ------------------------------------------------- PSU0 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MYR DC power | OK fan status | OK power (W) | 27 <--- 27+32=59 units rated typical 122, max 162 PSU1 status | OK P/N | MTEF-PSF-AC-I S/N | MT2238XZ0MZ2 DC power | OK fan status | OK power (W) | 32 ------------------------------------------------- temperature (C) | 39 <--- one degree higher max temp (C) | 45 ------------------------------------------------- fan status | OK <--- speeds about the same fan#1 (rpm) | 8337 fan#2 (rpm) | 7194 fan#3 (rpm) | 8287 fan#4 (rpm) | 7045 fan#5 (rpm) | 8389 fan#6 (rpm) | 7194 fan#7 (rpm) | 8441 fan#8 (rpm) | 7156 ------------------------------------------------- \\ **[[cluster:0|Back]]**