cluster:45
Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am.
How to shut down the cluster? Here are the steps i took.
Cluster Power Down
- #1
- Turn all queues to inactive 24 hours before shut down.
- comand:
badmin qinact -C “preparing for power outage 8/26 2-9 AM” all - All running jobs remain running.
- Any new jobs submitted go into PEND mode.
- #2
- Hours before actual shutdown, requeue all running jobs.
- command:
bjobs -r -u all | awk {'print $1}' | grep -v JOBID | fmt - Then feed that list of running JOBPIDs to the
brequeuecommand. - command:
brequeue -u all [list of JOBPIDs] - This moves the requeued jobs to the top of the PENDing jobs list.
- Check that all jobs are in PEND mode and nothing is running.
- command:
bqueues
- #3
- We're ready to bring all compute nodes (but not head & io node) down.
- command:
cluster-fork uptime - command:
cluster-fork halt - Check with first command and with console that hosts are halted. (We'll shut the power off next and if they are not properly halted, you're creating a lot of work for yourself upon reboot).
- #4
- Power off the compute nodes.
- Edit the script
/root/ipmi_nodesand supply argument 'on', 'off' or 'status'.
#!/bin/bash
# all compute nodes on IPMI subnet
for i in `seq 218 253`
do
# CAREFUL, OPTIONS ARE: status, on or off
echo 192.168.2.${i}
ipmitool -H 192.168.2.${i} -U XXXXXX -P YYYYYY chassis power off
done
- #5
- Issue
haltcommand on head node. And manually power off. - Issue
haltcommand on ionode. And manually power off. - (Arghh, ionode went down with IO errors … )
- #6
- Turn off power to UPS. (not done)
- Turn off power to switches. (not done)
- Turn off power to MD1000 disk arrays. (done)
Technically, the cluster still has power but the filers providing the file systems do not. Hence the power down. But in this case UPS, switches and MD1000s can stay powered up. In a complete power outage, turn all these devices off last.
Cluster Power Up
- #1
- Turn on power to UPS.
- Turn on power to switches.
- Turn on power to MD1000 disk arrays.
- #2
- Turn on ionode. (check fiber & nfs mounts/exports).
- Turn on head node. (check scheduler with
lsid&lsload).
- #3
- Turn on compute nodes with
ipmi_nodesprogram and proper argument. - Check that all compute nodes came back up without being re-imaged. You can do this with the command
cluster-fork cat /etc/motd » /tmp/foo 2>&1… this file should have the standard message of the day announcement not the “Kick started on such and such a date-time” message.
- #4
- This is a good time to clean up /sanscratch and all /localscratch dirs … on head node do
cluster-fork rm -rf /localscratch/[0-9]*andrm -rf /sanscratch/[0-9]* - Reactivate the queues (
badmin qact all). - Check that all possible jobs are restarted.
- #5
- Double check that Tivoli backups are scheduled to run.
- Inform user base.
- Go home and sleep.
cluster/45.txt · Last modified: by 127.0.0.1
