The catastrophic crash of June 08.
The actual cause of the crash is the filling of the 4TB home directory file system. This happened on
Sun Jun 22 16:23:54 EDT [filer3: wafl.vol.full:notice]: file system on volume cluster_home is full
My notes on the recovery are below.
Here are some facts
Mounting the cluster_home volume on swallowtail reveals …
Col A: The size of the physical LUNs (command against “lun files” inside volume)
Col B: The size used according to the linux host (local disk usage summary)
Col C: The size used according to the host filer3 (command against mount point)
Col D: The size of the “holes”, that is B-C
Col E: The time it took to rsync one way
Col F: NMAP
|rsync||B-C||df -h||du -hs||ll -h||LUN||Note|
|#7||a-i:1hr j-y:7hrs z:6hrs||126||217G||343G||1.1T||rusers3||adavis02 adezieck aminei bkormos dfrohman eaaron ebarnes gng gpetersson imukerji jbodyfelt jfarnham jknee lost+found mlee03 qgu spieniazek vclapa wpringle yminami ztan|
|#4||8 hrs||442||324G||766G||1.1T||rusers5||ajbenson alarner lvargaslara sdixit shorowitz skong|
|#2||-||18||464G||482G||1.1T||rusers7||abhattachary amoreno ewheatley fstarr gconnors jlocey mspescha vscavera wdai|
|#1||-||64||299G||363G||1.1T||users||(all other users) 36 hrs for rsync to build file inventory, abandoned for now|
test lun was NMAP #8
The LUNs with 104 MB in size as reported by linux host, or 17 GB as reported by host filer3 are the overhead numbers of empty or almost empty LUNs.
The linux host reports 2.8 TB to be used. The filer reports lightly over 4 TB to be used (hence we filled up the volume holding the LUNs). That results in a file system of deleted files, the “holes”, of 1.2 TB that needs to be reclaimed.
This will help with the restore order of the LUNs.
Columns D,C and B from table above. The filer reports the home directory volume is full. The volume is 4 TB space reserved, so yes, it is full at 4,045G used. The linux head node reports that 2.8 TB of data is found for the home dirs. That implies the filer has 1.2 TB of “holes”, areas of the file system not reclaimed. Defragmentation. The rate of file deletion must be tremendous.
Yes, it appears we have a lot of files. The TSM total file count ofcourse includes the one active and the one inactive version (if modified) of each file, or the deleted version only. Deleted files and inactive versions are kept for 30 days. So those are included in the 18 million count.
The linux head node reports that 2.8 TB of data is on disk. That would be roughly 1.4 TB of compressed TSM data. Wow. So 2.5 minus 1.4 equals 1.1 TB of compressed volume for the inactive and deleted versions of files. That would be the equivalent of 2.2 TB of data if on disk! Almost matches, pointing at a tremendous modification and/or deletion rate of files.