Warning: Undefined array key 12 in /usr/share/dokuwiki/inc/html.php on line 1453

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- cluster:217 [2022/06/20 09:48]
hmeij07 created
+++ cluster:217 [2022/06/24 09:53] (current)
hmeij07 [Slurm #3]
@@ Line 4: / Line 4: @@
 ==== Slurm entangles ====
-So, vaguely I remember when redoing my K20 gpu nodes I had troubles with that ASUS hardware and Warewulf 3.6. Now I have deployed a production cluster using OpenHPC 2.4, Rocky 8.5 and Warewulf 3.9 version. Same deal. Do not know what is going on but just documenting.
+So, vaguely I remember when redoing our K20 gpu nodes I had troubles with that ASUS hardware and Warewulf 3.6. Now I have deployed a production cluster using OpenHPC 2.4, Rocky 8.5 and Warewulf 3.9 version. Same deal. Do not know what is going on but just documenting.
 That's too bad as I was hoping to have a single operating system cluster. But now I will have to think about what to do with our CentOS 7 hardware which is running the old scheduler. The hope was to migrate everything to Slurm scheduler.
@@ Line 17: / Line 17: @@
   - Advanced > Network Stack – Enable
   - within that tab enable PXE4 and PXE6 support
-  - Boot -> boot order, network first then hard drive
+  - Boot > Boot order; network first then hard drive
   - Save & Exit > Yes & Reset
-Next we create the warewulf object and boot (see deploy script, at bottom).
+Next we create the warewulf node object and boot (see deploy script, at bottom).
-When this ASUS hardware boots, it sends over the correct mac address. We observer....
+When this ASUS hardware boots, it sends over the correct mac address. We observe....
 <code>
@@ Line 33: / Line 33: @@
 # in /etc/httpd/logs/access_log
-Jun  10 09:13:57 cottontail2 in.tftpd[388239]: Client ::ffff:192.168.102.100 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
+Jun  10 09:13:57 cottontail2 in.tftpd[388239]: Client ::ffff:192.168.102.100 \
+finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
 </code>
-That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error "no more network devices available", or some such. Then the node continues to boot of hard disk and CentOS 6 shows up.
+That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error "no more network devices available", or some such. Then the node continues to boot hard disk and CentOS 6 shows up.
 And when testing connectivity between node and SMS all is well ... but the GET never happens, the ixpe config file is there, the correct nic is responding, weird. ASUS splash screen: "In search of the incredible". Indeed.
@@ Line 58: / Line 59: @@
 ==== Slurm #1 ====
-First thought was to install OHPC v1.3 CentOS7 slurmd client on th enode, then join that to OHPC v2.4 slurmctld. To do that first ''yum install'' the ohpc-release from this location
+First Idea:  install OHPC v1.3 CentOS7 slurmd client on the node, then join that to OHPC v2.4 slurmctld. To do that first ''yum install'' the ohpc-release from this location
   * http://repos.openhpc.community/ohpc-1.3/1.3.9/base/CentOS_7/x86_64/ohpc-release-1.3-1.el7.x86_64.rpm
@@ Line 81: / Line 82: @@
 </code>
+Make sure munge/unmunge work between 1.3/2.4, that date is in sync (else you get error #16), and startup slurmd with 2.4 config files in place. This works but slurmd client of 1.3 fails to register. This appears to be an error in that the slurm versions are too far apart, 2018 vs 2020. Hmm, why is ophc v2.4 running such an old slurm version?
+Had to uncomment this for slurmd to start (seems ok because they are slurmctld settings not used by slurmd client...according to slurm list)
-Make sure munge/unmunge work bewteen 1.3/2.4, date is in sync (else you get error (16), and startup slurmd with 2.4 config files in place. This works but slurmd client of 1.3 fails to register. This appears to be an error in that the slurem versions are to far apart, 2018 vs 2020. Hmm, why is ophc v2.4 running such an old slurm version?
+<code>
+#SelectType=select/cons_tres
+#SelectTypeParameters=CR_CPU_Memory
+</code>
 <code>
@@ Line 106: / Line 114: @@
 ==== Slurm #2 ====
-I downloaded Slurm source the closest version just above ohpc v2.4 version. Next compile 20.11.9 slurm and see if it is accepted on ohpc v2.4 slurm 20.11.8
+Second Idea: download Slurm source the closest version just above ohpc v2.4 version. Next compile 20.11.9 slurm and see if it is accepted on ohpc v2.4 slurm 20.11.8 to register ....
 <code>
@@ Line 116: / Line 124: @@
 /share/apps/CENTOS7/openmpi/4.0.4/bin/mpicc
-./configure --prefix=/usr/local/slurm-20.11.9 \
+./configure \
+--prefix=/usr/local/slurm-20.11.9 \
 --sysconfdir=/usr/local/slurm-20.11.9/etc \
 --with-nvml=/usr/local/cuda
+make
+make install
 [root@n90 slurm-20.11.9]# find /usr/local/slurm-20.11.9 -name auth_munge.so
 /usr/local/slurm-20.11.9/lib/slurm/auth_munge.so
-# make the generic /usr/local/slurm -> /usr/local/slurm-20.11.9 link
+#
 </code>
@@ Line 130: / Line 140: @@
 YES! it does register, hurray.
-Finish with startup at boot in ''/etc/rc.local'' and exports envs in ''/etc/bashrc''
+Finish with
+  * make generic /usr/local/slurm -> /usr/local/slurm-20.11.9 link
+  * copy over munge.key, restart munge
+  * startup at boot in ''/etc/rc.local''
+  * export envs in ''/etc/bashrc''
+  * make dirs /var/log/slurm /var/spool/slurm
 Do **NOT** mount ''/opt/intel'' and ''/opt/ohpc/pub'' from SMS, that's all Rocky8.5 stuff.
 ==== Slurm #3 ====
-Just a mental note. There is a warning on Slurm weeb page re the older versions archives page
+There is a warning on Slurm web page re the older versions archives page
   * https://www.schedmd.com/archives.php
-Due to a security vulnerability (CVE-2022-29500), all versions of Slurm prior to 21.08.8 or 20.11.9 are no longer available for download...so why is openhpc v2.4 running such an old version?
+"Due to a security vulnerability (CVE-2022-29500), all versions of Slurm prior to 21.08.8 or 20.11.9 are no longer available for download"  ... so why is openhpc v2.4 running such an old slurm version?
-Once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munger version (although that should not matter, why does this scheduler even need munge?) and run Slurm outside of openhpc. A standalone version.
+Third Idea: once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munge version (although that should not matter, why does this scheduler even need munge?)
+Run Slurm outside of openhpc via local compile in ''/usr/local/''. A standalone version, the most recent version.
+Decided to go this route v22.05.02 standalone version (with ohpc v1.3 or v2.4 munge packages). You need all three packages (munge, munge-libs and munge-devel) on host where you compile slurm (note: cottontail2 for rocky8.5, node n90 for centos7). Then just copy.
+ --- //[[hmeij@wesleyan.edu|Henk]] 2022/06/23 19:07//
+==== Deploy ====
+I use a script to make sure I do not miss any steps when imaging. Works like a charm but with ASUS hardware.  This script will do stateless, stateful or golden image. For golden image creation follow this [[cluster:171|Warewulf Golden Image]] link.
+<code>
+#!/bin/bash
+# FIX vnfs & bootstrap for appropriate node
+# formats 1t /dev/nvme0n1 !!!
+# deploy a chroot server via PXE golden image transfer
+# templates are always in stateless CHROOT/rocky8.5/root/wwtemplates
+# look at header deploy.txt
+node=$1
+hwaddr0=$2
+ipaddr0=$3
+hwaddr1=$4
+ipaddr1=$5
+if [ $# != 5 ]; then
+        echo "missing args: node hwaddr0 ipaddr0 hwaddr1 ipaddr1 "
+        exit
+fi
+wwsh object delete $node -y
+sleep 3
+wwsh node new $node --netdev=eth0 \
+--hwaddr=$hwaddr0 --ipaddr=$ipaddr0 \
+--netmask=255.255.0.0  --network=255.255.0.0 -y
+wwsh node set $node --netdev=eth1 \
+--hwaddr=$hwaddr1 --ipaddr=$ipaddr1 \
+--netmask=255.255.0.0  --network=255.255.0.0 -y
+wwsh provision set $node --fileadd hosts,munge.key -y
+wwsh provision set $node --fileadd passwd,shadow,group -y
+wwsh provision set $node --fileadd network.ww,ifcfg-eth1.ww -y
+# PRESHELL & POSTSHELL 1=enable, 0=disable
+#wwsh provision set $node --postshell=1 -y
+#wwsh provision set $node --kargs="net.ifnames=0,biosdevname=0" -y
+#wwsh provision set --postnetdown=1 $node -y
+# stateless, comment out for golden image
+# wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y
+# wwsh provision set $node --vnfs=rocky8.5 -y
+# stateful, comment out for golden image and stateless
+# install grub2 in $CHROOT first, rebuild vnfs
+# wwsh provision set --filesystem=efi-n90  $node -y
+# wwsh provision set --bootloader=nvme0n1  $node -y
+# uncomment for golden image, comment out stateless and stateful
+ wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y
+ wwsh provision set $node --vnfs=n101.chroot -y
+ wwsh provision set --filesystem=efi-n90  $node -y
+ wwsh provision set --bootloader=nvme0n1  $node -y
+wwsh provision set --bootlocal=UNDEF $node -y
+echo "for stateful or golden image, after first boot issue"
+echo "wwsh provision set --bootlocal=normal $node -y"
+wwsh pxe update
+wwsh dhcp update
+systemctl restart dhcpd
+systemctl restart httpd
+systemctl restart tftp.socket
+# crontab will shutdown these services at 5pm
+</code>
+  * n90.deploy.txt
+<code>
+# formats 1T /dev/nvme0n1 !!!
+n90 04:D9:F5:BC:6E:C2 192.168.102.100 04:D9:F5:BC:6E:C3 10.10.102.100
+</code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools