\\
**[[cluster:0|Back]]**
===== XFS panic =====
We run a lot of XFS storage arrays with hardware raid controllers (Areca, MegaRAID). Rsync is used to pull content from active server to standby server in a continuous loop.
Usually something like this happens; a disk fails, a hot spare deploys, array rebuilds parity, failed disk gets replaced, new hot spare is created. All is well.
Sometimes, and I have noticed a pattern, not all is well; a volume check is in progress, a disk fails, volume check is aborted, a hot spare deploys, array starts rebuilding, XFS panics and takes the file system offline.
Not sure why, few details on the web. Steps to recover from this situation follows below and some guidance on how to deal with detached/corrupted files.
** First ** stop all replication engines or cron jobs that pull content from active server and back it up to standby server. At this time we want nothing to change on standby server while fixing active server.
** Next ** start a volume check with array controller on active server which will identify any bad blocks that should not be used from now on. Observed completion times of 8 to 80 hours with errors reported observed between 0 and 70 million.
** Next ** run ''yum update -y'' on active server . Comment out the array device mount in ''/etc/fstab'' and reboot. The reboot process will tell us the raid controller finds the storage array and all is in normal state hardware wise.
** Next ** mount the storage manually, this may take different forms, try in order...(repeat previous steps for standby server when recover process is done)
* ''mount /dev/sdb1'', if that works check ''dmesg'' for a clean mount message, and you are good to go, the journal log of changes has been applied successfully
* if that fails with a "structure needs cleaning" message, unmount device, and try ''xfs_repair -n /dev/sdb1'', if that finishes, redo without the ''-n'' flag (stands for no modifications).
* if that fails, we're definitely going to lose content, run the command while zeroing out the journal log ''xfs_repair -L /dev/sdb1''
The repair operation will relocate all corrupt files and detached files (includes directories) to a directory called ''lost+found'' (it will be created). Within this directory, paths are lost but some metadata is preserved. See section below on how to deal with that. Major pain.
** Next** mount the storage
** Next** on both active server and standby server run a ''du -hs * | tee du.txt'' and observe size differences of directories, reminder this is since last backup cycle happened, but also gives a sense of which directories were most impacted
** Next ** run a refresh __from__ standby to active server, we must perform this operation so that when we restart replication we do not clobber anything on active server with rsync's ''--delete'' flag.
* ''rsync -vac --whole-file --stats /path/to/standy/dir/ active:/path/to/active/dir/''
** Next ** unmount and mount the storage so journal log is applied
===== lost+found =====
** First ** create a file of all files found
* ''find /mindstore/lost+found -type f | tee /mindstore/lost+found.listing''
The files are in a "inode"/filename format and detached from their path. Any restoration is now a manual, tedious, project. It consist of two steps
- find direrctories by username and identify files
- make sure files are not corrupted using command ''file''
We'll use this username and show an example
# get user set of files
grep yezzyat /mindstore/lost+found.listing > /tmp/foo.log
# individual files have a count of 1
awk '{print $11}' /tmp/foo.log |awk -F\/ '{print $4}' | sort -n | uniq -c | sort -n | head
# output
1 101005276254
1 101005276255
1 101005276256
1 101005276257
1 101005276258
1 101005276259
1 101005276260
1 101005276261
1 101005276262
1 101005276263
# test first file
file /mindstore/lost+found/101005276254
# output
/mindstore/lost+found/101005276254: DICOM medical imaging data
# another test
ls -l /mindstore/lost+found/101005276254
# output
-rwxr-xr-x 1 yezzyat psyc 66752 Sep 10 2016 /mindstore/lost+found/101005276254
# look at it with utility "less" even if a binary, it may reveal some information, like
ORIGINAL\PRIMARY\V3_NYU_RAW
At this point you are ready to copy this file into your storage area
**DO NOT BLINDLY COPY FILES TO YOUR AREA** there will be corrupted files which **SHOULD NOT** be copied.
Lets look at directories, count > 1
awk '{print $11}' /tmp/foo.log |awk -F\/ '{print $4}' | sort -n | uniq -c | sort -n | tail
# output
5647 141871011888
5784 23668178350
8681 30256148259
10292 6534304103
10568 118181472697
15704 58087220750
16043 163276263379
17741 116024922424
18883 19388934039
20640 210500547885
# lets examine last directory with 20,640 "chunks"
file /mindstore/lost+found/210500547885
# output
/mindstore/lost+found/210500547885: directory
# more information
ls -ld /mindstore/lost+found/210500547885
# output
drwxr-xr-x 4 yezzyat psyc 63 Sep 10 2016 /mindstore/lost+found/210500547885
# anything in this directory and anything corrupt?
file /mindstore/lost+found/210500547885/*
# output
/mindstore/lost+found/210500547885/S07: directory
/mindstore/lost+found/210500547885/S07BV: directory
# again, anything corrupt in say S07BV !Beware of subdirectories! test each file!
file /mindstore/lost+found/210500547885/S07BV/*
# output
/mindstore/lost+found/210500547885/S07BV/S07_Run1_Enc_reconcorrected_firstvol_as_anat.amr: data
/mindstore/lost+found/210500547885/S07BV/S07_Run1_Enc_reconcorrected_firstvol.fmr: ASCII text
/mindstore/lost+found/210500547885/S07BV/S07_Run1_Enc_reconcorrected_firstvol.stc: data
/mindstore/lost+found/210500547885/S07BV/S07_Run1_Enc_reconcorrected.fmr: ASCII text
/mindstore/lost+found/210500547885/S07BV/S07_Run1_Enc_reconcorrected.stc: data
# now you have more metadata to decide where you are going to copy this to
# also beware you might already have this content
# and the directory was flagged as corrupt/detached for a different reason
# *or* the restore from standby to active pulled the backup copy into place
# finally file has an option
-f, --files-from FILE read the filenames to be examined from FILE
That's it. Some steps can be automated but the decision where to place the new files is the user's decision. You may wish to make a $HOME/crash-date directory and just put the files/dirs in there.
The directory ''lost+found'' is not used by the server and will eventually disappear (like when space is needed or next crash happens).
\\
**[[cluster:0|Back]]**