DokuWiki

Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am.

How to shut down the cluster? Here are the steps i took.

Cluster Power Down

#1
- Turn all queues to inactive 24 hours before shut down.
- comand: badmin qinact -C “preparing for power outage 8/26 2-9 AM” all
- All running jobs remain running.
- Any new jobs submitted go into PEND mode.

#2
- Hours before actual shutdown, requeue all running jobs.
- command: bjobs -r -u all | awk {'print $1}' | grep -v JOBID | fmt
- Then feed that list of running JOBPIDs to the brequeue command.
- command: brequeue -u all [list of JOBPIDs]
- This moves the requeued jobs to the top of the PENDing jobs list.
- Check that all jobs are in PEND mode and nothing is running.
- command: bqueues

#3
- We're ready to bring all compute nodes (but not head & io node) down.
- command: cluster-fork uptime
- command: cluster-fork halt
- Check with first command and with console that hosts are halted. (We'll shut the power off next and if they are not properly halted, you're creating a lot of work for yourself upon reboot).

#4
- Power off the compute nodes.
- Edit the script /root/ipmi_nodes and supply argument 'on', 'off' or 'status'.

#!/bin/bash

# all compute nodes on IPMI subnet
for i in `seq 218 253`
do

# CAREFUL, OPTIONS ARE: status, on or off
echo 192.168.2.${i}
ipmitool -H 192.168.2.${i} -U XXXXXX -P YYYYYY chassis power off

done

#5
- Issue halt command on head node. And manually power off.
- Issue halt command on ionode. And manually power off.
- (Arghh, ionode went down with IO errors … )

#6
- Turn off power to UPS. (not done)
- Turn off power to switches. (not done)
- Turn off power to MD1000 disk arrays. (done)

Technically, the cluster still has power but the filers providing the file systems do not. Hence the power down. But in this case UPS, switches and MD1000s can stay powered up. In a complete power outage, turn all these devices off last.

Cluster Power Up

#1
- Turn on power to UPS.
- Turn on power to switches.
- Turn on power to MD1000 disk arrays.

#2
- Turn on ionode. (check fiber & nfs mounts/exports).
- Turn on head node. (check scheduler with lsid & lsload).

#3
- Turn on compute nodes with ipmi_nodes program and proper argument.
- Check that all compute nodes came back up without being re-imaged. You can do this with the command cluster-fork cat /etc/motd » /tmp/foo 2>&1 … this file should have the standard message of the day announcement not the “Kick started on such and such a date-time” message.

#4
- This is a good time to clean up /sanscratch and all /localscratch dirs … on head node do cluster-fork rm -rf /localscratch/[0-9]* and rm -rf /sanscratch/[0-9]*
- Reactivate the queues (badmin qact all).
- Check that all possible jobs are restarted.

#5
- Double check that Tivoli backups are scheduled to run.
- Inform user base.
- Go home and sleep.

DokuWiki

User Tools

Site Tools

Cluster Power Down

Cluster Power Up

Page Tools