User Tools

Site Tools


cluster:221


Back

Infiniband Monitoring

The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ibswinfo which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a SB7890 EDR. You will also need

  • infiniband-diags
  • awk, sed, coreutils,

But first lets download MFT for Linux x86 as an rpm package and install. Use the install.sh with no arguments.

This is hairier than it looks. MFT will write as root to /dev and build kernel modules so make sure you have a backup. It will also pull down a suite of packages. So rather than do this on a storage server or head node I will install it on a compute node connected to the switch.

yumdownloader --destdir=`pwd` \
annobin dwz efi-srpm-macros elfutils gc gcc-plugin-annobin \
gdb-headless ghc-srpm-macros go-srpm-macros guile libatomic_ops \
libbabeltrace libipt ocaml-srpm-macros openblas-srpm-macros \
patch perl-srpm-macros python-rpm-macros python-srpm-macros \
python3-rpm-macros qt5-srpm-macros redhat-rpm-config \
rust-srpm-macros zlib-devel zstd elfutils-libelf-devel \
python3-rpm-generators-5-7.el8.noarch.rpm

rm -f *i686.rpm

# copy to n102 and install as root

======================================================================================================
 Package                        Architecture    Version                      Repository          Size
======================================================================================================
Installing:
 kernel-devel                   x86_64          4.18.0-425.3.1.el8           baseos              22 M
 make                           x86_64          1:4.2.1-11.el8               baseos             497 k
 rpm-build                      x86_64          4.14.3-24.el8_7              appstream          173 k
Installing dependencies:
 annobin                        x86_64          10.67-3.el8                  appstream          954 k
 dwz                            x86_64          0.12-10.el8                  appstream          108 k
 efi-srpm-macros                noarch          3-3.el8                      appstream           21 k
 elfutils                       x86_64          0.187-4.el8                  baseos             542 k
 gc                             x86_64          7.6.4-3.el8                  appstream          108 k
 gcc-plugin-annobin             x86_64          8.5.0-15.el8                 appstream           34 k
 gdb-headless                   x86_64          8.2-19.el8                   appstream          3.7 M
 ghc-srpm-macros                noarch          1.4.2-7.el8                  appstream          8.3 k
 go-srpm-macros                 noarch          2-17.el8                     appstream           12 k
 guile                          x86_64          5:2.0.14-7.el8               appstream          3.5 M
 libatomic_ops                  x86_64          7.6.2-3.el8                  appstream           37 k
 libbabeltrace                  x86_64          1.5.4-4.el8                  baseos             199 k
 libipt                         x86_64          1.6.1-8.el8                  appstream           49 k
 ocaml-srpm-macros              noarch          5-4.el8                      appstream          8.3 k
 openblas-srpm-macros           noarch          2-2.el8                      appstream          6.9 k
 patch                          x86_64          2.7.6-11.el8                 baseos             137 k
 perl-srpm-macros               noarch          1-25.el8                     appstream          9.7 k
 python-rpm-macros              noarch          3-43.el8                     appstream           15 k
 python-srpm-macros             noarch          3-43.el8                     appstream           14 k
 python3-rpm-macros             noarch          3-43.el8                     appstream           14 k
python3-rpm-generators          noarch          5-7.el8                      appstream           14k
 qt5-srpm-macros                noarch          5.15.3-1.el8                 appstream          9.5 k
 redhat-rpm-config              noarch          130-1.el8                    appstream           89 k
 rust-srpm-macros               noarch          5-2.el8                      appstream          8.2 k
 zlib-devel                     x86_64          1.2.11-20.el8                baseos              57 k
 zstd                           x86_64          1.4.4-1.el8                  appstream          392 k
Installing weak dependencies:
 elfutils-libelf-devel          x86_64          0.187-4.el8                  baseos              60 k

[root@n102 mft-4.23.0-104-x86_64-rpm]# ./install.sh 
-I- Removing any old MFT file if exists...
-I- Building the MFT kernel binary RPM...
-I- Installing the MFT RPMs...
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:kernel-mft-4.23.0-4.18.0_425.3.1.################################# [100%]
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:mft-4.23.0-104                   ################################# [100%]
-I- In order to start mst, please run "mst start".

[root@n102 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

[root@n102 ~]# mst ib add
-I- Discovering the fabric - Running: ibnetdiscover
-I- Added 8 in-band devices

[root@n102 ~]# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:31:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

Inband devices:
-------------------
/dev/mst/CA_MT4119_astrostore_mlx5_0_lid-0x0001
/dev/mst/CA_MT4119_n102_mlx5_0_lid-0x0005              <--- node names
/dev/mst/CA_MT4119_n103_mlx5_0_lid-0x0007
/dev/mst/CA_MT4119_n104_mlx5_0_lid-0x0004
/dev/mst/CA_MT4119_n105_mlx5_0_lid-0x0002
/dev/mst/CA_MT4119_n106_mlx5_0_lid-0x0006
/dev/mst/CA_MT4119_n107_mlx5_0_lid-0x0008
/dev/mst/SW_MT53000_SwitchIB_lid-0x0003

Ok, so onward to stage ibswinfo.sh from https://github.com/stanford-rc/ibswinfo

Download the script and stage in /usr/bin

Probe…

[root@n102 ~]# ibswinfo.sh -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003 
=================================================
SwitchIB Mellanox Technologies
=================================================
part number        | MSB7890-ES2F
serial number      | MT2239XZ011W
product name       | Scorpion2 IB EDR Unmanaged
revision           | AC
ports              | 36
PSID               | MT_2640110032
GUID               | 0x900a840300ecde60
firmware version   | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s)   | 33d-21:11:22
-------------------------------------------------
PSU0 status        | OK
     P/N           | MTEF-PSF-AC-I
     S/N           | MT2238XZ0MYR
     DC power      | ERROR
     fan status    | ERROR
PSU1 status        | OK
     P/N           | MTEF-PSF-AC-I
     S/N           | MT2238XZ0MZ2
     DC power      | OK
     fan status    | OK
     power (W)     | 50
-------------------------------------------------
temperature (C)    | 38
max temp (C)       | 45
-------------------------------------------------
fan status         | ERROR
fan#1 (rpm)        | 8441
fan#2 (rpm)        | 7156
fan#3 (rpm)        | 8389
fan#4 (rpm)        | 7270
fan#5 (rpm)        | 8337
fan#6 (rpm)        | 7194
fan#7 (rpm)        | 8389
fan#8 (rpm)        | 7156
-------------------------------------------------

Looks like I don't have both power units plugged in Will have to check next time I'm in.

Other useful commands…

[root@n102 ~]# ibnodes
Ca      : 0xb83fd2030063fc88 ports 1 "n107 mlx5_0"
Ca      : 0xb83fd2030063f8a4 ports 1 "n106 mlx5_0"
Ca      : 0xb83fd2030063fb5c ports 1 "n105 mlx5_0"
Ca      : 0xb83fd2030063f88c ports 1 "n104 mlx5_0"
Ca      : 0xb83fd2030063faa4 ports 1 "astrostore mlx5_0"
Ca      : 0xb83fd2030063fac8 ports 1 "n103 mlx5_0"
Ca      : 0xb83fd2030063fca0 ports 1 "n102 mlx5_0"
Switch  : 0x900a840300ecde60 ports 37 "SwitchIB Mellanox Technologies" base port 0 lid 3 lmc 0
[root@n102 ~]# ibstatus
Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:b83f:d203:0063:fca0
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (4X EDR)
        link_layer:      InfiniBand

Under load with full power…

ibswinfo -d /dev/mst/SW_MT53000_SwitchIB_lid-0x0003
=================================================
SwitchIB Mellanox Technologies
=================================================
part number        | MSB7890-ES2F
serial number      | MT2239XZ011W
product name       | Scorpion2 IB EDR Unmanaged
revision           | AC
ports              | 36
PSID               | MT_2640110032
GUID               | 0x900a840300ecde60
firmware version   | 15.2008.2102
-------------------------------------------------
uptime (d-h:m:s)   | 46d-21:14:46
-------------------------------------------------
PSU0 status        | OK
     P/N           | MTEF-PSF-AC-I
     S/N           | MT2238XZ0MYR
     DC power      | OK
     fan status    | OK
     power (W)     | 27    <--- 27+32=59 units rated typical 122, max 162
PSU1 status        | OK
     P/N           | MTEF-PSF-AC-I
     S/N           | MT2238XZ0MZ2
     DC power      | OK
     fan status    | OK
     power (W)     | 32
-------------------------------------------------
temperature (C)    | 39    <--- one degree higher
max temp (C)       | 45
-------------------------------------------------
fan status         | OK    <--- speeds about the same
fan#1 (rpm)        | 8337
fan#2 (rpm)        | 7194
fan#3 (rpm)        | 8287
fan#4 (rpm)        | 7045
fan#5 (rpm)        | 8389
fan#6 (rpm)        | 7194
fan#7 (rpm)        | 8441
fan#8 (rpm)        | 7156
-------------------------------------------------


Back

cluster/221.txt · Last modified: 2023/03/14 09:59 by hmeij07