Differences

This shows you the differences between two versions of the page.

--- cluster:45 [2007/08/27 13:58]
+++ cluster:45 [2007/08/27 13:58] (current)
@@ Line 1: / Line 1: @@
+\\
+**[[cluster:0|Back]]**
+Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am.
+How to shut down the cluster?  Here are the steps i took.
+===== Cluster Power Down =====
+  * #1
+    * Turn all queues to inactive 24 hours before shut down.
+    * comand: ''badmin qinact -C "preparing for power outage 8/26 2-9 AM" all''
+    * All running jobs remain running.
+    * Any new jobs submitted go into PEND mode.
+  * #2
+    * Hours before actual shutdown, requeue all running jobs.
+    * command: ''bjobs -r -u all | awk {'print $1}' | grep -v JOBID | fmt''
+    * Then feed that list of running JOBPIDs to the ''brequeue'' command.
+    * command: ''brequeue -u all [list of JOBPIDs]''
+    * This moves the requeued jobs to the top of the PENDing jobs list.
+    * Check that all jobs are in PEND mode and nothing is running.
+    * command: ''bqueues''
+  * #3
+    * We're ready to bring all compute nodes (but not head & io node) down.
+    * command: ''cluster-fork uptime''
+    * command: ''cluster-fork halt''
+    * Check with first command and with console that hosts are halted. (We'll shut the power off next and if they are not properly halted, you're creating a lot of work for yourself upon reboot).
+  * #4
+    * Power off the compute nodes.
+    * Edit the script ''/root/ipmi_nodes'' and supply argument 'on', 'off' or 'status'.
+<code>
+#!/bin/bash
+# all compute nodes on IPMI subnet
+for i in `seq 218 253`
+do
+# CAREFUL, OPTIONS ARE: status, on or off
+echo 192.168.2.${i}
+ipmitool -H 192.168.2.${i} -U XXXXXX -P YYYYYY chassis power off
+done
+</code>
+  * #5
+    * Issue ''halt'' command on head node.  __And manually power off.__
+    * Issue ''halt'' command on ionode. __And manually power off.__
+    * (Arghh, ionode went down with IO errors ... )
+  * #6
+    * Turn off power to UPS. (not done)
+    * Turn off power to switches. (not done)
+    * Turn off power to MD1000 disk arrays. (done)
+Technically, the cluster still has power but the filers providing the file systems do not.  Hence the power down.  But in this case UPS, switches and MD1000s can stay powered up. In a complete power outage, turn all these devices off last.
+===== Cluster Power Up =====
+  * #1
+    * Turn on power to UPS.
+    * Turn on power to switches.
+    * Turn on power to MD1000 disk arrays.
+  * #2
+    * Turn on ionode. (check fiber & nfs mounts/exports).
+    * Turn on head node. (check scheduler with ''lsid'' & ''lsload'').
+  * #3
+    * Turn on compute nodes with ''ipmi_nodes'' program and proper argument.
+    * Check that all compute nodes came back up without being re-imaged.  You can do this with the command ''cluster-fork cat /etc/motd >> /tmp/foo 2>&1'' ... this file should have the standard message of the day announcement __not__ the "Kick started on such and such a date-time" message.
+  * #4
+    * This is a **//good//** time to clean up /sanscratch and all /localscratch dirs ... on head node do ''cluster-fork rm -rf /localscratch/[0-9]*'' and ''rm -rf /sanscratch/[0-9]*''
+    * Reactivate the queues (''badmin qact all'').
+    * Check that all possible jobs are restarted.
+  * #5
+    * Double check that Tivoli backups are scheduled to run.
+    * Inform user base.
+    * Go home and sleep.

DokuWiki

User Tools

Site Tools

Differences

Page Tools