This shows you the differences between two versions of the page.
cluster:67 [2008/06/29 19:51] |
cluster:67 [2008/06/29 19:51] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | \\ | ||
+ | **[[cluster: | ||
+ | |||
+ | ==== The catastrophic crash of June 08 ==== | ||
+ | |||
+ | A huge thank you to all for being patient, understanding, | ||
+ | |||
+ | Here are some notes taken while restoring the custer: | ||
+ | |||
+ | |||
+ | ==== Configuration Changes ==== | ||
+ | |||
+ | Previously, our home directories and sanscratch file system areas were 4 TB and 1 TB volumes, respectively, | ||
+ | |||
+ | On the ionode, local file systems are made of these block devices. | ||
+ | |||
+ | This design provides a 1 gigabit per second (gbps) pipe from all compute nodes to the host ionode. | ||
+ | |||
+ | That design has now changed. | ||
+ | |||
+ | The consequences of this are that each compute node now can directly read/write to the filers via the switch using the 1 gbps switch. | ||
+ | |||
+ | |||
+ | ==== Filer4 ==== | ||
+ | |||
+ | Our previous host serving the home directories was filer3.wesleyan.edu ... in order to get rid of the holes, all directories were copied using " | ||
+ | |||
+ | The old directories will be kept for a month or so as a safe guard. | ||
+ | |||
+ | |||
+ | ==== ionode ==== | ||
+ | |||
+ | Since all compute are directly connected to the filers, our ionode has no function on the cluster. | ||
+ | |||
+ | So, options | ||
+ | * the ionode could become a compute node (needs LSF license and more memory) | ||
+ | * the ionode could provide arger volumes of fast local disk (greater than the 230GB of the heavy weight nodes) | ||
+ | * the ionode could become a secondary cluster "head " node | ||
+ | * the ionode could become a secondary LSF " | ||
+ | * others | ||
+ | |||
+ | |||
+ | ==== " | ||
+ | |||
+ | The crash of the file system was created by tremendous growth and deletion of files. | ||
+ | |||
+ | Because we are now using NFS, any file deletions will immediately result in that space flowing back into the main volume. | ||
+ | |||
+ | ==== Snapshots ==== | ||
+ | |||
+ | Snapshots of file changes have been turned of. They will not be re-enabled due to the disk space requirements. | ||
+ | |||
+ | Users should be aware that only the nightly incremental TSM backups will be run. | ||
+ | |||
+ | |||
+ | ==== TSM ==== | ||
+ | |||
+ | If you read the musings section of notes taken during restoration, | ||
+ | |||
+ | This policy is resulting in a huge amount of files on the TSM server; over 18 million. | ||
+ | |||
+ | |||
+ | ==== Quotas ==== | ||
+ | |||
+ | It might be time to enforce quotas although i'd prefer to not go there. | ||
+ | |||
+ | My task is to come up with a better way to measure file system growth. | ||
+ | |||
+ | |||
+ | ==== /sanscratch ==== | ||
+ | |||
+ | Since few people used / | ||
+ | |||
+ | |||
+ | |||
+ | ==== Anything Else? ==== | ||
+ | \\ | ||
+ | **[[cluster: |