\\
**[[cluster:0|Back]]**
The catastrophic crash of June 08.
The actual cause of the crash is the filling of the 4TB home directory file system. This happened on
//Sun Jun 22 16:23:54 EDT [filer3: wafl.vol.full:notice]: file system on volume cluster_home is full//
My notes on the recovery are below.
Here are some facts
* from middle march to end june (~3 months) tsm files grew from 4,724,563 to 18,020,855
* from middle march to end june (~3 months) tsm volume grew from 1.6 to 2.5 TB (compressed so multiply by 2x for uncompressed volume)
* from middle march to end june (~3 months) the filer's disk space usage (4TB) grew from 1.7 to 2.9 TB
* on one LUN, rsync took 36 hrs! to build the list of files present before copying started
===== LUN Sizes =====
Mounting the cluster_home volume on swallowtail reveals ...
Col A: The size of the physical LUNs (command against "lun files" inside volume)\\
Col B: The size used according to the linux host (local disk usage summary)\\
Col C: The size used according to the host filer3 (command against mount point)\\
Col D: The size of the "holes", that is B-C\\
Col E: The time it took to rsync one way\\
Col F: NMAP
^ F ^ E ^ D ^ C ^ B ^ A ^ ^^
^ ^ rsync ^ B-C ^ df -h ^ du -hs ^ ll -h ^ LUN ^ Note ^
| na | - | - | - | - | 1T | sanscratch | (re-empty) |
| #10 | - | - | 104M | 17G | 1.1T | cusers | (empty) |
| #9 | | 148 | 663G | 811G | 1.1T | rusers | chsu sknauert |
| #12 | | 253 | 771G | 1.1T | 1.1T | rusers2 | dbblum |
| #7 | a-i:1hr j-y:7hrs z:6hrs | 126 | 217G | 343G | 1.1T | rusers3 | adavis02 adezieck aminei bkormos dfrohman eaaron ebarnes gng gpetersson imukerji jbodyfelt jfarnham jknee lost+found mlee03 qgu spieniazek vclapa wpringle yminami ztan|
| #3 | - | - | 104M | 17G | 1.1T | rusers4 | (empty) |
| #4 | 8 hrs | 442 | 324G | 766G | 1.1T | rusers5 | ajbenson alarner lvargaslara sdixit shorowitz skong |
| #6 | - | - | 129M | 17G | 1.1T | rusers6 | chemdata |
| #2 | - | 18 | 464G | 482G |1.1T | rusers7 | abhattachary amoreno ewheatley fstarr gconnors jlocey mspescha vscavera wdai|
| #5 | 5 hrs | 179 | 77G | 256G | 1.1T | rusers8 | bstewart |
| #1 | - | 64 | 299G | 363G | 1.1T | users | (all other users) 36 hrs for rsync to build file inventory, abandoned for now |
^ ^ ^ 1,230G ^ 2,815G ^ 4,045G ^ ^^^
test lun was NMAP #8
The LUNs with 104 MB in size as reported by linux host, or 17 GB as reported by host filer3 are the overhead numbers of empty or almost empty LUNs.
The linux host reports 2.8 TB to be used. The filer reports lightly over 4 TB to be used (hence we filled up the volume holding the LUNs). That results in a file system of deleted files, the "holes", of 1.2 TB that needs to be reclaimed.
===== Order of Restores =====
\\
Delete
\\
* sanscratch ... lun deleted, waiting for reclamation, offlined lun volume
* sanscratch ...create nfs dir on filer4, nfs reconfig mount
\\
Few "holes"
\\
* cusers ... rsynced over, nfs reconfig, mount
* rusers4 ... rsynced over, nfs reconfig, mount
* rusers6 ... rsynced over, nfs reconfig, mount
* rusers7 rsyncing over all nfs reconfig, mount
\\
\\
Many "holes"
\\
- users ... rsyncing over hpc05, nfs reconfig, mount
- rusers5 ... rsynced over, umount, nfs reconfig, mount
- rusers3 ... rsynced over, umount, nfs reconfig, mount
- rusers8 ... rsynced over, umount, nfs reconfig, mount
- rusers ... rsyncing over chsu, nfs reconfig, mount
- rusers2 ... rsynced over, nfs reconfig, mount
===== Large Home Dirs =====
This will help with the restore order of the LUNs.
^ Size ^ Username ^
| 235G |./users/hpc05|
| 663G |./rusers/chsu|
| 771G |./rusers2/dbblum|
| 87G |./rusers3/jbodyfelt|
| 107G |./rusers3/ztan|
| 72G |./rusers5/ajbenson|
| 134G |./rusers5/sdixit|
| 87G |./rusers5/skong|
| 455G |./rusers7/wdai|
| 76G |./rusers8/bstewart|
===== Musings =====
\\
* 1,230G & 2,815G & 4,045G
\\
Columns D,C and B from table above. The filer reports the home directory volume is full. The volume is 4 TB space reserved, so yes, it is full at 4,045G used. The linux head node reports that 2.8 TB of data is found for the home dirs. That implies the filer has 1.2 TB of "holes", areas of the file system not reclaimed. Defragmentation. The rate of file deletion must be tremendous.
\\
* tsm total number of files for the cluster grew from 4,724,563 to 18,020,855 in the last 3 months.
\\
Yes, it appears we have a lot of files. The TSM total file count ofcourse includes the one active and the one inactive version (if modified) of each file, or the deleted version only. Deleted files and inactive versions are kept for 30 days. So those are included in the 18 million count.
\\
* tsm total volume for the cluster grew from 1.6 to 2.5 TB
\\
The linux head node reports that 2.8 TB of data is on disk. That would be roughly 1.4 TB of compressed TSM data. Wow. So 2.5 minus 1.4 equals 1.1 TB of compressed volume for the inactive and deleted versions of files. That would be the equivalent of 2.2 TB of data if on disk! Almost matches, pointing at a tremendous modification and/or deletion rate of files.
\\
**[[cluster:0|Back]]**