This shows you the differences between two versions of the page.
cluster:45 [2007/08/27 13:58] |
cluster:45 [2007/08/27 13:58] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | \\ | ||
+ | **[[cluster: | ||
+ | |||
+ | Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am. | ||
+ | |||
+ | How to shut down the cluster? | ||
+ | |||
+ | |||
+ | ===== Cluster Power Down ===== | ||
+ | |||
+ | * #1 | ||
+ | * Turn all queues to inactive 24 hours before shut down. | ||
+ | * comand: '' | ||
+ | * All running jobs remain running. | ||
+ | * Any new jobs submitted go into PEND mode. | ||
+ | |||
+ | * #2 | ||
+ | * Hours before actual shutdown, requeue all running jobs. | ||
+ | * command: '' | ||
+ | * Then feed that list of running JOBPIDs to the '' | ||
+ | * command: '' | ||
+ | * This moves the requeued jobs to the top of the PENDing jobs list. | ||
+ | * Check that all jobs are in PEND mode and nothing is running. | ||
+ | * command: '' | ||
+ | |||
+ | * #3 | ||
+ | * We're ready to bring all compute nodes (but not head & io node) down. | ||
+ | * command: '' | ||
+ | * command: '' | ||
+ | * Check with first command and with console that hosts are halted. (We'll shut the power off next and if they are not properly halted, you're creating a lot of work for yourself upon reboot). | ||
+ | |||
+ | * #4 | ||
+ | * Power off the compute nodes. | ||
+ | * Edit the script ''/ | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | # all compute nodes on IPMI subnet | ||
+ | for i in `seq 218 253` | ||
+ | do | ||
+ | |||
+ | # CAREFUL, OPTIONS ARE: status, on or off | ||
+ | echo 192.168.2.${i} | ||
+ | ipmitool -H 192.168.2.${i} -U XXXXXX -P YYYYYY chassis power off | ||
+ | |||
+ | done | ||
+ | |||
+ | </ | ||
+ | |||
+ | * #5 | ||
+ | * Issue '' | ||
+ | * Issue '' | ||
+ | * (Arghh, ionode went down with IO errors ... ) | ||
+ | |||
+ | * #6 | ||
+ | * Turn off power to UPS. (not done) | ||
+ | * Turn off power to switches. (not done) | ||
+ | * Turn off power to MD1000 disk arrays. (done) | ||
+ | |||
+ | Technically, | ||
+ | |||
+ | ===== Cluster Power Up ===== | ||
+ | |||
+ | * #1 | ||
+ | * Turn on power to UPS. | ||
+ | * Turn on power to switches. | ||
+ | * Turn on power to MD1000 disk arrays. | ||
+ | |||
+ | * #2 | ||
+ | * Turn on ionode. (check fiber & nfs mounts/ | ||
+ | * Turn on head node. (check scheduler with '' | ||
+ | |||
+ | * #3 | ||
+ | * Turn on compute nodes with '' | ||
+ | * Check that all compute nodes came back up without being re-imaged. | ||
+ | |||
+ | * #4 | ||
+ | * This is a **// | ||
+ | * Reactivate the queues ('' | ||
+ | * Check that all possible jobs are restarted. | ||
+ | |||
+ | * #5 | ||
+ | * Double check that Tivoli backups are scheduled to run. | ||
+ | * Inform user base. | ||
+ | * Go home and sleep. | ||