Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am.
How to shut down the cluster? Here are the steps i took.
badmin qinact -C “preparing for power outage 8/26 2-9 AM” all
bjobs -r -u all | awk {'print $1}' | grep -v JOBID | fmt
brequeue
command.brequeue -u all [list of JOBPIDs]
bqueues
cluster-fork uptime
cluster-fork halt
/root/ipmi_nodes
and supply argument 'on', 'off' or 'status'.#!/bin/bash # all compute nodes on IPMI subnet for i in `seq 218 253` do # CAREFUL, OPTIONS ARE: status, on or off echo 192.168.2.${i} ipmitool -H 192.168.2.${i} -U XXXXXX -P YYYYYY chassis power off done
halt
command on head node. And manually power off.halt
command on ionode. And manually power off.Technically, the cluster still has power but the filers providing the file systems do not. Hence the power down. But in this case UPS, switches and MD1000s can stay powered up. In a complete power outage, turn all these devices off last.
lsid
& lsload
).ipmi_nodes
program and proper argument.cluster-fork cat /etc/motd » /tmp/foo 2>&1
… this file should have the standard message of the day announcement not the “Kick started on such and such a date-time” message.cluster-fork rm -rf /localscratch/[0-9]*
and rm -rf /sanscratch/[0-9]*
badmin qact all
).