User Tools

Site Tools


cluster:67


Back

The catastrophic crash of June 08

A huge thank you to all for being patient, understanding, and supportive during the week of downtime!

Here are some notes taken while restoring the custer: LINK

Configuration Changes

Previously, our home directories and sanscratch file system areas were 4 TB and 1 TB volumes, respectively, on host “filer3”. They contained 1 TB thin provisioned LUNs which the host filer3 makes accessible as block devices to the host “ionode” on our cluster.

On the ionode, local file systems are made of these block devices. These filesystems are then exported via NFS over the private subnet 10.3.1.x to all compute nodes. The computes mount these home directories as needed when jobs are run via “autofs”.

This design provides a 1 gigabit per second (gbps) pipe from all compute nodes to the host ionode. The link between the ionode and host filer3 is a 4 gbps fiber channel pipe (multipathed so that if one fiber channel pathway fails, there is a fail over).

That design has now changed. We now have from the host filer3 and host filer4 (which are clustered network attached storage devices) a 1 gbps ethernet pipe to the Cisco ethernet switch directly. This means these filers are on the 10.3.1.x private network. We could do this because the filers had unused ports.

The consequences of this are that each compute node now can directly read/write to the filers via the switch using the 1 gbps switch. We loose the 4 gbps fiber channel link but since the compute nodes previously were linked to the ionode over the same switch i'm not anticipating loosing performance. We may gain performance as each node can go directly to the filers without creating a bottleneck using the host ionode. We'll see.

Filer4

Our previous host serving the home directories was filer3.wesleyan.edu … in order to get rid of the holes, all directories were copied using “rsync” to a new host filer4.wesleyan.edu. These hosts fail-over to each other if a failure occurs. Hence two ethernet cables had to be strung. The hosts are identical to each other.

The old directories will be kept for a month or so as a safe guard. A notice will be send out when old directories are deleted.

ionode

Since all compute are directly connected to the filers, our ionode has no function on the cluster. I will leave the LUN configuration in place. In the future it may be that a project requires large volumes of fast local disks, in which case we could provide that via the 4gbps fiber channel while running those jobs on the ionode locally.

So, options

  • the ionode could become a compute node (needs LSF license and more memory)
  • the ionode could provide arger volumes of fast local disk (greater than the 230GB of the heavy weight nodes)
  • the ionode could become a secondary cluster “head ” node
  • the ionode could become a secondary LSF “master” scheduler
  • others

"Holes"

The crash of the file system was created by tremendous growth and deletion of files. This resulted in large volumes of unused space … sort of a very defragmented file system.

Because we are now using NFS, any file deletions will immediately result in that space flowing back into the main volume. It becomes available immediately because now the filers are involved in the delete actions.

Snapshots

Snapshots of file changes have been turned of. They will not be re-enabled due to the disk space requirements.

Users should be aware that only the nightly incremental TSM backups will be run.

TSM

If you read the musings section of notes taken during restoration, you are aware of the following. Our backup policy is tot keep one inactive version of each modified file for 30 days. We also keep deleted files for 30 days before purging.

This policy is resulting in a huge amount of files on the TSM server; over 18 million. We should probably change the retention period but I'm unsure which one to shorten.

Quotas

It might be time to enforce quotas although i'd prefer to not go there. I would like to stress that users that embark on projects suddenly loading large volumes of data communicate this to me. My monitors had zero time to warn me of the sudden spike in file system growth.

My task is to come up with a better way to measure file system growth. I'll probably try to use our Nagios/Cacti monitoring and alerting tools for that.

/sanscratch

Since few people used /sanscratch, i have folded it inside the same volume as the home directories. That achieves two things: one) folks that use it can, it is still shared amongst all the compute nodes but is not backed up, and two) it rolls the 1 TB locked away by the LUN into the home directory area.

Anything Else?

cluster/67.txt · Last modified: 2008/06/29 19:51 (external edit)