User Tools

Site Tools


cluster:217

This is an old revision of the document!



Back

Slurm entangles

So, vaguely I remember when redoing my K20 gpu nodes I had troubles with that ASUS hardware and Warewulf 3.6. Now I have deployed a production cluster using OpenHPC 2.4, Rocky 8.5 and Warewulf 3.9 version. Same deal. Do not know what is going on but just documenting.

That's too bad as I was hoping to have a single operating system cluster. But now I will have to think about what to do with our CentOS 7 hardware which is running the old scheduler. The hope was to migrate everything to Slurm scheduler.

ASUS

First we reset the BIOS and make sure PXE boot is enable, legacy boot mode.

  1. Save & Exit > Restore Defaults > Yes > Save & Reset, then next boot
  2. Advanced > CMS Configuration > Enable > Boot Option filter = Legacy
  3. Advanced > CMS Configuration → Network = Legacy
  4. Advanced > Network Stack – Enable
  5. within that tab enable PXE4 and PXE6 support
  6. Boot → boot order, network first then hard drive
  7. Save & Exit > Yes & Reset

Next we create the warewulf object and boot (see deploy script, at bottom).

When this ASUS hardware boots, it sends over the correct mac address. We observer….

# in /var/log/messages

Jun 10 09:13:41 cottontail2 dhcpd[380262]: DHCPDISCOVER from 04:d9:f5:bc:6e:c2 via eth0
Jun 10 09:13:41 cottontail2 dhcpd[380262]: DHCPOFFER on 192.168.102.100 to 04:d9:f5:bc:6e:c2 via eth0

# in /etc/httpd/logs/access_log

Jun  10 09:13:57 cottontail2 in.tftpd[388239]: Client ::ffff:192.168.102.100 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe

That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error “no more network devices available”, or some such. Then the node continues to boot of hard disk and CentOS 6 shows up.

And when testing connectivity between node and SMS all is well … but the GET never happens, the ixpe config file is there, the correct nic is responding, weird. ASUS splash screen: “In search of the incredible”. Indeed.

[root@n90 tmp]# telnet cottontail2 80
Trying 192.168.102.250...
Connected to cottontail2.
Escape character is '^]'.
GET /WW/file?hwaddr=04:d9:f5:bc:6e:c2&timestamp=0

# all files are retrieved

Slurm #1

First thought was to install OHPC v1.3 CentOS7 slurmd client on th enode, then join that to OHPC v2.4 slurmctld. To do that first yum install the ohpc-release from this location

Next do a 'yum install generic-pacakge-name' of these packages to install slurmd client of ohpc 1.3 for centos 7.

-rw-r--r-- 1 root root    35264 Feb 23  2017 munge-devel-ohpc-0.5.12-21.1.x86_64.rpm
-rw-r--r-- 1 root root    51432 Jun 16 14:40 munge-libs-ohpc-0.5.12-21.1.x86_64.rpm
-rw-r--r-- 1 root root   114060 Jun 16 14:40 munge-ohpc-0.5.12-21.1.x86_64.rpm
-rw-r--r-- 1 root root     3468 Jun 16 14:44 ohpc-filesystem-1.3-26.1.ohpc.1.3.6.noarch.rpm
-rw-r--r-- 1 root root     2396 Jun 16 14:40 ohpc-slurm-client-1.3.8-3.1.ohpc.1.3.8.x86_64.rpm
-rw-r--r-- 1 root root  4434196 Jun 16 14:46 pmix-ohpc-2.2.2-9.1.ohpc.1.3.7.x86_64.rpm
-rw-r--r-- 1 root root    17324 Jun 16 14:40 slurm-contribs-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
-rw-r--r-- 1 root root   198028 Jun 16 14:40 slurm-example-configs-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
-rw-r--r-- 1 root root 13375940 Jun 16 14:40 slurm-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
-rw-r--r-- 1 root root   148980 Jun 16 14:40 slurm-pam_slurm-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
-rw-r--r-- 1 root root   796280 Jun 16 14:44 slurm-perlapi-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
-rw-r--r-- 1 root root   654104 Jun 16 14:40 slurm-slurmd-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm

Make sure munge/unmunge work bewteen 1.3/2.4, date is in sync (else you get error (16), and startup slurmd with 2.4 config files in place. This works but slurmd client of 1.3 fails to register. This appears to be an error in that the slurem versions are to far apart, 2018 vs 2020. Hmm, why is ophc v2.4 running such an old slurm version?

[root@cottontail2 ~]#  munge -n -t 10 | ssh n90 unmunge
STATUS:           Success (0)
ENCODE_HOST:      cottontail2 (192.168.102.250)
ENCODE_TIME:      2022-06-17 09:35:08 -0400 (1655472908)
DECODE_TIME:      2022-06-17 09:35:07 -0400 (1655472907)
TTL:              10
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

Too bad. Ok, we'll keep the munge packages and remove all other ohpc v1.3 packages.

Slurm #2

I downloaded Slurm source the closest version just above ohpc v2.4 version. Next compile 20.11.9 slurm and see if it is accepted on ohpc v2.4 slurm 20.11.8

export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
[root@n90 ~]# which gcc mpicc
/usr/bin/gcc
/share/apps/CENTOS7/openmpi/4.0.4/bin/mpicc

./configure --prefix=/usr/local/slurm-20.11.9 \
--sysconfdir=/usr/local/slurm-20.11.9/etc \
--with-nvml=/usr/local/cuda


[root@n90 slurm-20.11.9]# find /usr/local/slurm-20.11.9 -name auth_munge.so
/usr/local/slurm-20.11.9/lib/slurm/auth_munge.so

# make the generic /usr/local/slurm -> /usr/local/slurm-20.11.9 link

YES! it does register, hurray.

Finish with startup at boot in /etc/rc.local and exports envs in /etc/bashrc

Do NOT mount /opt/intel and /opt/ohpc/pub from SMS, that's all Rocky8.5 stuff.

Slurm #3

Just a mental note. There is a warning on Slurm weeb page re the older versions archives page

Due to a security vulnerability (CVE-2022-29500), all versions of Slurm prior to 21.08.8 or 20.11.9 are no longer available for download…so why is openhpc v2.4 running such an old version?

Once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munger version (although that should not matter, why does this scheduler even need munge?) and run Slurm outside of openhpc. A standalone version.


Back

cluster/217.1655732907.txt.gz · Last modified: 2022/06/20 09:48 by hmeij07