This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
cluster:217 [2022/06/20 13:48] hmeij07 created |
cluster:217 [2022/06/24 13:53] (current) hmeij07 [Slurm #3] |
||
---|---|---|---|
Line 4: | Line 4: | ||
==== Slurm entangles ==== | ==== Slurm entangles ==== | ||
- | So, vaguely I remember when redoing | + | So, vaguely I remember when redoing |
That's too bad as I was hoping to have a single operating system cluster. But now I will have to think about what to do with our CentOS 7 hardware which is running the old scheduler. The hope was to migrate everything to Slurm scheduler. | That's too bad as I was hoping to have a single operating system cluster. But now I will have to think about what to do with our CentOS 7 hardware which is running the old scheduler. The hope was to migrate everything to Slurm scheduler. | ||
Line 17: | Line 17: | ||
- Advanced > Network Stack – Enable | - Advanced > Network Stack – Enable | ||
- within that tab enable PXE4 and PXE6 support | - within that tab enable PXE4 and PXE6 support | ||
- | - Boot -> boot order, network first then hard drive | + | - Boot > Boot order; network first then hard drive |
- Save & Exit > Yes & Reset | - Save & Exit > Yes & Reset | ||
- | Next we create the warewulf object and boot (see deploy script, at bottom). | + | Next we create the warewulf |
- | When this ASUS hardware boots, it sends over the correct mac address. We observer.... | + | When this ASUS hardware boots, it sends over the correct mac address. We observe.... |
< | < | ||
Line 33: | Line 33: | ||
# in / | # in / | ||
- | Jun 10 09:13:57 cottontail2 in.tftpd[388239]: | + | Jun 10 09:13:57 cottontail2 in.tftpd[388239]: |
+ | finished / | ||
</ | </ | ||
- | That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error "no more network devices available", | + | That's it. Everything goes quiet. On the node's console during pxe boot I observe ipxe net0 being configured with correct mac address, then it times out with the error "no more network devices available", |
And when testing connectivity between node and SMS all is well ... but the GET never happens, the ixpe config file is there, the correct nic is responding, weird. ASUS splash screen: "In search of the incredible" | And when testing connectivity between node and SMS all is well ... but the GET never happens, the ixpe config file is there, the correct nic is responding, weird. ASUS splash screen: "In search of the incredible" | ||
Line 58: | Line 59: | ||
==== Slurm #1 ==== | ==== Slurm #1 ==== | ||
- | First thought was to install OHPC v1.3 CentOS7 slurmd client on th enode, then join that to OHPC v2.4 slurmctld. To do that first '' | + | First Idea: |
* http:// | * http:// | ||
Line 81: | Line 82: | ||
</ | </ | ||
+ | Make sure munge/ | ||
+ | Had to uncomment this for slurmd to start (seems ok because they are slurmctld settings not used by slurmd client...according to slurm list) | ||
- | Make sure munge/unmunge work bewteen 1.3/2.4, date is in sync (else you get error (16), and startup slurmd with 2.4 config files in place. This works but slurmd client of 1.3 fails to register. This appears to be an error in that the slurem versions are to far apart, 2018 vs 2020. Hmm, why is ophc v2.4 running such an old slurm version? | + | < |
+ | |||
+ | # | ||
+ | # | ||
+ | |||
+ | </code> | ||
< | < | ||
Line 106: | Line 114: | ||
==== Slurm #2 ==== | ==== Slurm #2 ==== | ||
- | I downloaded | + | Second Idea: download |
< | < | ||
Line 116: | Line 124: | ||
/ | / | ||
- | ./configure --prefix=/ | + | ./ |
+ | --prefix=/ | ||
--sysconfdir=/ | --sysconfdir=/ | ||
--with-nvml=/ | --with-nvml=/ | ||
+ | make | ||
+ | make install | ||
[root@n90 slurm-20.11.9]# | [root@n90 slurm-20.11.9]# | ||
/ | / | ||
- | # make the generic / | + | # |
</ | </ | ||
Line 130: | Line 140: | ||
YES! it does register, hurray. | YES! it does register, hurray. | ||
- | Finish with startup at boot in ''/ | + | Finish with |
+ | * make generic / | ||
+ | * copy over munge.key, restart munge | ||
+ | * startup at boot in ''/ | ||
+ | * export | ||
+ | * make dirs / | ||
Do **NOT** mount ''/ | Do **NOT** mount ''/ | ||
+ | |||
==== Slurm #3 ==== | ==== Slurm #3 ==== | ||
- | Just a mental note. There is a warning on Slurm weeb page re the older versions archives page | + | There is a warning on Slurm web page re the older versions archives page |
* https:// | * https:// | ||
- | Due to a security vulnerability (CVE-2022-29500), | + | "Due to a security vulnerability (CVE-2022-29500), |
- | Once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munger | + | Third Idea: once we're fully deployed I may go to latest Slurm version, run on different ports with maybe newer munge version (although that should not matter, why does this scheduler even need munge?) |
+ | Run Slurm outside of openhpc via local compile in ''/ | ||
+ | Decided to go this route v22.05.02 standalone version (with ohpc v1.3 or v2.4 munge packages). You need all three packages (munge, munge-libs and munge-devel) on host where you compile slurm (note: cottontail2 for rocky8.5, node n90 for centos7). Then just copy. | ||
+ | |||
+ | |||
+ | --- // | ||
+ | ==== Deploy ==== | ||
+ | |||
+ | I use a script to make sure I do not miss any steps when imaging. Works like a charm but with ASUS hardware. | ||
+ | |||
+ | < | ||
+ | |||
+ | #!/bin/bash | ||
+ | |||
+ | # FIX vnfs & bootstrap for appropriate node | ||
+ | # formats 1t / | ||
+ | |||
+ | # deploy a chroot server via PXE golden image transfer | ||
+ | # templates are always in stateless CHROOT/ | ||
+ | # look at header deploy.txt | ||
+ | |||
+ | node=$1 | ||
+ | hwaddr0=$2 | ||
+ | ipaddr0=$3 | ||
+ | hwaddr1=$4 | ||
+ | ipaddr1=$5 | ||
+ | |||
+ | if [ $# != 5 ]; then | ||
+ | echo " | ||
+ | exit | ||
+ | fi | ||
+ | |||
+ | wwsh object delete $node -y | ||
+ | sleep 3 | ||
+ | |||
+ | wwsh node new $node --netdev=eth0 \ | ||
+ | --hwaddr=$hwaddr0 --ipaddr=$ipaddr0 \ | ||
+ | --netmask=255.255.0.0 | ||
+ | |||
+ | wwsh node set $node --netdev=eth1 \ | ||
+ | --hwaddr=$hwaddr1 --ipaddr=$ipaddr1 \ | ||
+ | --netmask=255.255.0.0 | ||
+ | |||
+ | wwsh provision set $node --fileadd hosts, | ||
+ | wwsh provision set $node --fileadd passwd, | ||
+ | wwsh provision set $node --fileadd network.ww, | ||
+ | |||
+ | # PRESHELL & POSTSHELL 1=enable, 0=disable | ||
+ | #wwsh provision set $node --postshell=1 -y | ||
+ | #wwsh provision set $node --kargs=" | ||
+ | #wwsh provision set --postnetdown=1 $node -y | ||
+ | |||
+ | # stateless, comment out for golden image | ||
+ | # wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y | ||
+ | # wwsh provision set $node --vnfs=rocky8.5 -y | ||
+ | |||
+ | # stateful, comment out for golden image and stateless | ||
+ | # install grub2 in $CHROOT first, rebuild vnfs | ||
+ | # wwsh provision set --filesystem=efi-n90 | ||
+ | # wwsh provision set --bootloader=nvme0n1 | ||
+ | |||
+ | # uncomment for golden image, comment out stateless and stateful | ||
+ | wwsh provision set $node --bootstrap=4.18.0-348.12.2.el8_5.x86_64 -y | ||
+ | wwsh provision set $node --vnfs=n101.chroot -y | ||
+ | wwsh provision set --filesystem=efi-n90 | ||
+ | wwsh provision set --bootloader=nvme0n1 | ||
+ | |||
+ | |||
+ | wwsh provision set --bootlocal=UNDEF $node -y | ||
+ | echo "for stateful or golden image, after first boot issue" | ||
+ | echo "wwsh provision set --bootlocal=normal $node -y" | ||
+ | |||
+ | wwsh pxe update | ||
+ | wwsh dhcp update | ||
+ | systemctl restart dhcpd | ||
+ | systemctl restart httpd | ||
+ | systemctl restart tftp.socket | ||
+ | # crontab will shutdown these services at 5pm | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | * n90.deploy.txt | ||
+ | |||
+ | < | ||
+ | |||
+ | # formats 1T / | ||
+ | n90 04: | ||
+ | |||
+ | </ | ||