The catastrophic crash of June 08.

The actual cause of the crash is the filling of the 4TB home directory file system. This happened on

Sun Jun 22 16:23:54 EDT [filer3: wafl.vol.full:notice]: file system on volume cluster_home is full

My notes on the recovery are below.

Here are some facts

  • from middle march to end june (~3 months) tsm files grew from 4,724,563 to 18,020,855
  • from middle march to end june (~3 months) tsm volume grew from 1.6 to 2.5 TB (compressed so multiply by 2x for uncompressed volume)
  • from middle march to end june (~3 months) the filer's disk space usage (4TB) grew from 1.7 to 2.9 TB
  • on one LUN, rsync took 36 hrs! to build the list of files present before copying started

LUN Sizes

Mounting the cluster_home volume on swallowtail reveals …

Col A: The size of the physical LUNs (command against “lun files” inside volume)
Col B: The size used according to the linux host (local disk usage summary)
Col C: The size used according to the host filer3 (command against mount point)
Col D: The size of the “holes”, that is B-C
Col E: The time it took to rsync one way

rsync B-C df -h du -hs ll -h LUN Note
na - - - - 1T sanscratch (re-empty)
#10 - - 104M 17G 1.1T cusers (empty)
#9 148 663G 811G 1.1T rusers chsu sknauert
#12 253 771G 1.1T 1.1T rusers2 dbblum
#7 a-i:1hr j-y:7hrs z:6hrs 126 217G 343G 1.1T rusers3 adavis02 adezieck aminei bkormos dfrohman eaaron ebarnes gng gpetersson imukerji jbodyfelt jfarnham jknee lost+found mlee03 qgu spieniazek vclapa wpringle yminami ztan
#3 - - 104M 17G 1.1T rusers4 (empty)
#4 8 hrs 442 324G 766G 1.1T rusers5 ajbenson alarner lvargaslara sdixit shorowitz skong
#6 - - 129M 17G 1.1T rusers6 chemdata
#2 - 18 464G 482G 1.1T rusers7 abhattachary amoreno ewheatley fstarr gconnors jlocey mspescha vscavera wdai
#5 5 hrs 179 77G 256G 1.1T rusers8 bstewart
#1 - 64 299G 363G 1.1T users (all other users) 36 hrs for rsync to build file inventory, abandoned for now
1,230G 2,815G 4,045G

test lun was NMAP #8

The LUNs with 104 MB in size as reported by linux host, or 17 GB as reported by host filer3 are the overhead numbers of empty or almost empty LUNs.

The linux host reports 2.8 TB to be used. The filer reports lightly over 4 TB to be used (hence we filled up the volume holding the LUNs). That results in a file system of deleted files, the “holes”, of 1.2 TB that needs to be reclaimed.

Order of Restores


  • <hi #ffff00>sanscratch … lun deleted, waiting for reclamation, offlined lun volume</hi>
  • <hi #ffff00>sanscratch …create nfs dir on filer4, nfs reconfig mount</hi>

Few “holes”

  • <hi #ffff00>cusers … rsynced over, nfs reconfig, mount</hi>
  • <hi #ffff00>rusers4 … rsynced over, nfs reconfig, mount </hi>
  • <hi #ffff00>rusers6 … rsynced over, nfs reconfig, mount </hi>
  • <hi #ffff00>rusers7 rsyncing over all</hi> nfs reconfig, mount

Many “holes”

  1. <hi #ffff00>users … rsyncing over hpc05,</hi> nfs reconfig, mount
  2. <hi #ffff00>rusers5 … rsynced over, umount, nfs reconfig, mount</hi>
  3. <hi #ffff00>rusers3 … rsynced over, umount, nfs reconfig, mount</hi>
  4. <hi #ffff00>rusers8 … rsynced over, umount, nfs reconfig, mount</hi>
  5. <hi #fa8072>rusers … rsyncing over chsu,</hi> nfs reconfig, mount
  6. <hi #ffff00>rusers2 … rsynced over, nfs reconfig, mount</hi>

Large Home Dirs

This will help with the restore order of the LUNs.

Size Username
235G ./users/hpc05
663G ./rusers/chsu
771G ./rusers2/dbblum
87G ./rusers3/jbodyfelt
107G ./rusers3/ztan
72G ./rusers5/ajbenson
134G ./rusers5/sdixit
87G ./rusers5/skong
455G ./rusers7/wdai
76G ./rusers8/bstewart


  • 1,230G & 2,815G & 4,045G

Columns D,C and B from table above. The filer reports the home directory volume is full. The volume is 4 TB space reserved, so yes, it is full at 4,045G used. The linux head node reports that 2.8 TB of data is found for the home dirs. That implies the filer has 1.2 TB of “holes”, areas of the file system not reclaimed. Defragmentation. The rate of file deletion must be tremendous.

  • tsm total number of files for the cluster grew from 4,724,563 to 18,020,855 in the last 3 months.

Yes, it appears we have a lot of files. The TSM total file count ofcourse includes the one active and the one inactive version (if modified) of each file, or the deleted version only. Deleted files and inactive versions are kept for 30 days. So those are included in the 18 million count.

  • tsm total volume for the cluster grew from 1.6 to 2.5 TB

The linux head node reports that 2.8 TB of data is on disk. That would be roughly 1.4 TB of compressed TSM data. Wow. So 2.5 minus 1.4 equals 1.1 TB of compressed volume for the inactive and deleted versions of files. That would be the equivalent of 2.2 TB of data if on disk! Almost matches, pointing at a tremendous modification and/or deletion rate of files.

