So, vaguely I remember when redoing our K20 gpu nodes I had troubles with that ASUS hardware and Warewulf 3.6. Now I have deployed a production cluster using OpenHPC 2.4, Rocky 8.5 and Warewulf 3.9 version. Same deal. Do not know what is going on but just documenting.
That's too bad as I was hoping to have a single operating system cluster. But now I will have to think about what to do with our CentOS 7 hardware which is running the old scheduler. The hope was to migrate everything to Slurm scheduler.
First we reset the BIOS and make sure PXE boot is enable, legacy boot mode.
Next we create the warewulf node object and boot (see deploy script, at bottom).
When this ASUS hardware boots, it sends over the correct mac address. We observe….
# in /var/log/messages Jun 10 09:13:41 cottontail2 dhcpd[380262]: DHCPDISCOVER from 04:d9:f5:bc:6e:c2 via eth0 Jun 10 09:13:41 cottontail2 dhcpd[380262]: DHCPOFFER on 192.168.102.100 to 04:d9:f5:bc:6e:c2 via eth0 # in /etc/httpd/logs/access_log Jun 10 09:13:57 cottontail2 in.tftpd[388239]: Client ::ffff:192.168.102.100 \ finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error “no more network devices available”, or some such. Then the node continues to boot hard disk and CentOS 6 shows up.
And when testing connectivity between node and SMS all is well … but the GET never happens, the ixpe config file is there, the correct nic is responding, weird. ASUS splash screen: “In search of the incredible”. Indeed.
[root@n90 tmp]# telnet cottontail2 80 Trying 192.168.102.250... Connected to cottontail2. Escape character is '^]'. GET /WW/file?hwaddr=04:d9:f5:bc:6e:c2×tamp=0 # all files are retrieved
First Idea: install OHPC v1.3 CentOS7 slurmd client on the node, then join that to OHPC v2.4 slurmctld. To do that first yum install
the ohpc-release from this location
Next do a 'yum install generic-pacakge-name' of these packages to install slurmd client of ohpc 1.3 for centos 7.
-rw-r--r-- 1 root root 35264 Feb 23 2017 munge-devel-ohpc-0.5.12-21.1.x86_64.rpm -rw-r--r-- 1 root root 51432 Jun 16 14:40 munge-libs-ohpc-0.5.12-21.1.x86_64.rpm -rw-r--r-- 1 root root 114060 Jun 16 14:40 munge-ohpc-0.5.12-21.1.x86_64.rpm -rw-r--r-- 1 root root 3468 Jun 16 14:44 ohpc-filesystem-1.3-26.1.ohpc.1.3.6.noarch.rpm -rw-r--r-- 1 root root 2396 Jun 16 14:40 ohpc-slurm-client-1.3.8-3.1.ohpc.1.3.8.x86_64.rpm -rw-r--r-- 1 root root 4434196 Jun 16 14:46 pmix-ohpc-2.2.2-9.1.ohpc.1.3.7.x86_64.rpm -rw-r--r-- 1 root root 17324 Jun 16 14:40 slurm-contribs-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm -rw-r--r-- 1 root root 198028 Jun 16 14:40 slurm-example-configs-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm -rw-r--r-- 1 root root 13375940 Jun 16 14:40 slurm-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm -rw-r--r-- 1 root root 148980 Jun 16 14:40 slurm-pam_slurm-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm -rw-r--r-- 1 root root 796280 Jun 16 14:44 slurm-perlapi-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm -rw-r--r-- 1 root root 654104 Jun 16 14:40 slurm-slurmd-ohpc-18.08.8-4.1.ohpc.1.3.8.1.x86_64.rpm
Make sure munge/unmunge work between 1.3/2.4, that date is in sync (else you get error #16), and startup slurmd with 2.4 config files in place. This works but slurmd client of 1.3 fails to register. This appears to be an error in that the slurm versions are too far apart, 2018 vs 2020. Hmm, why is ophc v2.4 running such an old slurm version?
Had to uncomment this for slurmd to start (seems ok because they are slurmctld settings not used by slurmd client…according to slurm list)
#SelectType=select/cons_tres #SelectTypeParameters=CR_CPU_Memory
[root@cottontail2 ~]# munge -n -t 10 | ssh n90 unmunge STATUS: Success (0) ENCODE_HOST: cottontail2 (192.168.102.250) ENCODE_TIME: 2022-06-17 09:35:08 -0400 (1655472908) DECODE_TIME: 2022-06-17 09:35:07 -0400 (1655472907) TTL: 10 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0
Too bad. Ok, we'll keep the munge packages and remove all other ohpc v1.3 packages.
Second Idea: download Slurm source the closest version just above ohpc v2.4 version. Next compile 20.11.9 slurm and see if it is accepted on ohpc v2.4 slurm 20.11.8 to register ….
export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH [root@n90 ~]# which gcc mpicc /usr/bin/gcc /share/apps/CENTOS7/openmpi/4.0.4/bin/mpicc ./configure \ --prefix=/usr/local/slurm-20.11.9 \ --sysconfdir=/usr/local/slurm-20.11.9/etc \ --with-nvml=/usr/local/cuda make make install [root@n90 slurm-20.11.9]# find /usr/local/slurm-20.11.9 -name auth_munge.so /usr/local/slurm-20.11.9/lib/slurm/auth_munge.so #
YES! it does register, hurray.
Finish with
/etc/rc.local
/etc/bashrc
Do NOT mount /opt/intel
and /opt/ohpc/pub
from SMS, that's all Rocky8.5 stuff.
There is a warning on Slurm web page re the older versions archives page
“Due to a security vulnerability (CVE-2022-29500), all versions of Slurm prior to 21.08.8 or 20.11.9 are no longer available for download” … so why is openhpc v2.4 running such an old slurm version?
Third Idea: once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munge version (although that should not matter, why does this scheduler even need munge?)
Run Slurm outside of openhpc via local compile in /usr/local/
. A standalone version, the most recent version.
Decided to go this route v22.05.02 standalone version (with ohpc v1.3 or v2.4 munge packages). You need all three packages (munge, munge-libs and munge-devel) on host where you compile slurm (note: cottontail2 for rocky8.5, node n90 for centos7). Then just copy.
— Henk 2022/06/23 19:07
I use a script to make sure I do not miss any steps when imaging. Works like a charm but with ASUS hardware. This script will do stateless, stateful or golden image. For golden image creation follow this Warewulf Golden Image link.
#!/bin/bash # FIX vnfs & bootstrap for appropriate node # formats 1t /dev/nvme0n1 !!! # deploy a chroot server via PXE golden image transfer # templates are always in stateless CHROOT/rocky8.5/root/wwtemplates # look at header deploy.txt node=$1 hwaddr0=$2 ipaddr0=$3 hwaddr1=$4 ipaddr1=$5 if [ $# != 5 ]; then echo "missing args: node hwaddr0 ipaddr0 hwaddr1 ipaddr1 " exit fi wwsh object delete $node -y sleep 3 wwsh node new $node --netdev=eth0 \ --hwaddr=$hwaddr0 --ipaddr=$ipaddr0 \ --netmask=255.255.0.0 --network=255.255.0.0 -y wwsh node set $node --netdev=eth1 \ --hwaddr=$hwaddr1 --ipaddr=$ipaddr1 \ --netmask=255.255.0.0 --network=255.255.0.0 -y wwsh provision set $node --fileadd hosts,munge.key -y wwsh provision set $node --fileadd passwd,shadow,group -y wwsh provision set $node --fileadd network.ww,ifcfg-eth1.ww -y # PRESHELL & POSTSHELL 1=enable, 0=disable #wwsh provision set $node --postshell=1 -y #wwsh provision set $node --kargs="net.ifnames=0,biosdevname=0" -y #wwsh provision set --postnetdown=1 $node -y # stateless, comment out for golden image # wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y # wwsh provision set $node --vnfs=rocky8.5 -y # stateful, comment out for golden image and stateless # install grub2 in $CHROOT first, rebuild vnfs # wwsh provision set --filesystem=efi-n90 $node -y # wwsh provision set --bootloader=nvme0n1 $node -y # uncomment for golden image, comment out stateless and stateful wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y wwsh provision set $node --vnfs=n101.chroot -y wwsh provision set --filesystem=efi-n90 $node -y wwsh provision set --bootloader=nvme0n1 $node -y wwsh provision set --bootlocal=UNDEF $node -y echo "for stateful or golden image, after first boot issue" echo "wwsh provision set --bootlocal=normal $node -y" wwsh pxe update wwsh dhcp update systemctl restart dhcpd systemctl restart httpd systemctl restart tftp.socket # crontab will shutdown these services at 5pm
# formats 1T /dev/nvme0n1 !!! n90 04:D9:F5:BC:6E:C2 192.168.102.100 04:D9:F5:BC:6E:C3 10.10.102.100