This shows you the differences between two versions of the page.
— |
cluster:66 [2008/06/28 15:59] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | \\ | ||
+ | **[[cluster: | ||
+ | The catastrophic crash of June 08. | ||
+ | |||
+ | The actual cause of the crash is the filling of the 4TB home directory file system. | ||
+ | |||
+ | //Sun Jun 22 16:23:54 EDT [filer3: wafl.vol.full: | ||
+ | |||
+ | |||
+ | My notes on the recovery are below. | ||
+ | |||
+ | Here are some facts | ||
+ | |||
+ | * from middle march to end june (~3 months) tsm files grew from 4,724,563 to 18, | ||
+ | * from middle march to end june (~3 months) tsm volume grew from 1.6 to 2.5 TB (compressed so multiply by 2x for uncompressed volume) | ||
+ | * from middle march to end june (~3 months) the filer' | ||
+ | * on one LUN, rsync took 36 hrs! to build the list of files present before copying started | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== LUN Sizes ===== | ||
+ | |||
+ | Mounting the cluster_home volume on swallowtail reveals ... | ||
+ | |||
+ | Col A: The size of the physical LUNs (command against "lun files" inside volume)\\ | ||
+ | Col B: The size used according to the linux host (local disk usage summary)\\ | ||
+ | Col C: The size used according to the host filer3 (command against mount point)\\ | ||
+ | Col D: The size of the " | ||
+ | Col E: The time it took to rsync one way\\ | ||
+ | Col F: NMAP | ||
+ | |||
+ | ^ F ^ E ^ D ^ C ^ B ^ A ^ ^^ | ||
+ | ^ ^ rsync ^ B-C ^ df -h ^ du -hs ^ ll -h ^ LUN ^ Note ^ | ||
+ | | na | - | - | - | - | 1T | sanscratch | (re-empty) | | ||
+ | | #10 | - | - | 104M | 17G | 1.1T | cusers | (empty) | | ||
+ | | #9 | | 148 | 663G | 811G | 1.1T | rusers | chsu sknauert | | ||
+ | | #12 | | 253 | 771G | 1.1T | 1.1T | rusers2 | dbblum | ||
+ | | #7 | a-i:1hr j-y:7hrs z: | ||
+ | | #3 | - | - | 104M | 17G | 1.1T | rusers4 | (empty) | | ||
+ | | #4 | 8 hrs | 442 | 324G | 766G | 1.1T | rusers5 | ajbenson | ||
+ | | #6 | - | - | 129M | 17G | 1.1T | rusers6 | chemdata | | ||
+ | | #2 | - | 18 | 464G | 482G |1.1T | rusers7 | abhattachary amoreno ewheatley fstarr gconnors jlocey mspescha | ||
+ | | #5 | 5 hrs | 179 | 77G | 256G | 1.1T | rusers8 | bstewart | | ||
+ | | #1 | | ||
+ | ^ ^ ^ 1, | ||
+ | |||
+ | test lun was NMAP #8 | ||
+ | |||
+ | The LUNs with 104 MB in size as reported by linux host, or 17 GB as reported by host filer3 are the overhead numbers of empty or almost empty LUNs. | ||
+ | |||
+ | The linux host reports 2.8 TB to be used. The filer reports lightly over 4 TB to be used (hence we filled up the volume holding the LUNs). | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Order of Restores ===== | ||
+ | |||
+ | \\ | ||
+ | Delete | ||
+ | \\ | ||
+ | * <hi # | ||
+ | * <hi # | ||
+ | \\ | ||
+ | Few " | ||
+ | \\ | ||
+ | * <hi # | ||
+ | * <hi # | ||
+ | * <hi # | ||
+ | * <hi # | ||
+ | \\ | ||
+ | \\ | ||
+ | Many " | ||
+ | \\ | ||
+ | - <hi # | ||
+ | - <hi # | ||
+ | - <hi # | ||
+ | - <hi # | ||
+ | - <hi # | ||
+ | - <hi # | ||
+ | |||
+ | |||
+ | ===== Large Home Dirs ===== | ||
+ | |||
+ | This will help with the restore order of the LUNs. | ||
+ | |||
+ | ^ Size ^ Username | ||
+ | | 235G |./ | ||
+ | | 663G |./ | ||
+ | | 771G |./ | ||
+ | | 87G |./ | ||
+ | | 107G |./ | ||
+ | | 72G |./ | ||
+ | | 134G |./ | ||
+ | | 87G |./ | ||
+ | | 455G |./ | ||
+ | | 76G |./ | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Musings ===== | ||
+ | \\ | ||
+ | * 1,230G & 2,815G & 4,045G | ||
+ | \\ | ||
+ | Columns D,C and B from table above. | ||
+ | \\ | ||
+ | * tsm total number of files for the cluster grew from 4,724,563 to 18,020,855 in the last 3 months. | ||
+ | \\ | ||
+ | Yes, it appears we have a lot of files. | ||
+ | \\ | ||
+ | * tsm total volume for the cluster grew from 1.6 to 2.5 TB | ||
+ | \\ | ||
+ | The linux head node reports that 2.8 TB of data is on disk. That would be roughly 1.4 TB of compressed TSM data. Wow. So 2.5 minus 1.4 equals 1.1 TB of compressed volume for the inactive and deleted versions of files. That would be the equivalent of 2.2 TB of data if on disk! Almost matches, pointing at a tremendous modification and/or deletion rate of files. | ||
+ | \\ | ||
+ | **[[cluster: |