Differences

This shows you the differences between two versions of the page.

--- cluster:194 [2020/07/28 17:35]
hmeij07 [HDD]
+++ cluster:194 [2021/03/09 14:47]
hmeij07 [Update]
@@ Line 115: / Line 115: @@
 Virtual IP ''hpcstore.wesleyan.edu'' floats back and forth seamlessly (tested, some protocols will loose connectivity). In a split brain situation (no response, both controllers think they are **it**), disconnect one controller from power then reboot.  Then reconnect and wait a few minutes for HA icon to turn green when controller comes online.
-An update goes like this and is not an interruption. Check for and apply updates. They are applied to partner and partner is rebooted. When partner comes back up it becomes the primary. Now you need apply updates to other partner. When it comes back up it remains secondary node.  Check that nodes run the same version.
+Critical for Failover Network Interfaces marked for IGB0 and IGB1 (/zfshomes via NFS) and lagg0 (vlan52)
+You can always Disable Failover, to fix power feed of switches 192.168.0.0/16 or 10.10.0.0/16
+Check Box to Disable Failover
+Go to WebUI > System > Failover > Click the Box > Then Click Save (leave default controller setting as is)
+This will allow you to make your network change without failing over.  Sync to Peer, probably not necessary since you are on the Active controller. Once you are finished, then yes sync with Failover Enabled (no standby reboot).
 ==== SSH ====
@@ Line 350: / Line 358: @@
 </code>
-==== Update ====
+==== Update 11 ====
-Change the Train to 11.3, then you will apply the update first in the WebUI to the passive controller.
+See Update 12 for manual update to v12 with Anthony on 03.09.2021
+**Change the Train** to 11.3, then you will apply the update first in the WebUI to the passive controller.
 After its reboots, you will failover to it by **rebooting** the Active controller (the **current** WebUI).
@@ Line 364: / Line 374: @@
 Enable HA, click icon
-Apply Pending Updates
+**Apply Pending Updates**
 Upgrades both controllers. Files are downloaded to the Active Controller and then transferred to the Standby Controller. The upgrade process starts concurrently on both TrueNAS Controllers.
-Server responds while HA disabled.  Fail over just take 5 seconds. Update takes 15 mins. Log out and log back in once the passive standby is on new update.
+Server responds while HA disabled.  You are instructed to Initiate Fail Over, do so, just take 5 seconds. The Continue with pending upgrade ... wait 5 mins or so, watch console activity.  **THEN** Log out and log back in once the passive standby is on new update.
+Update takes 15 mins in total.
+** 11.3 U5 **
+  * Check for Updates, read release Notes, schedule support ticket if needed (major update)
+  * Apply Pending update, save configuration, **confirm box** check for it!
+  * active/standby both download and install
+  * At 100% stanby reboots, HA disables, file system ok
+  * Check version on standby, Initiate Fail Over (interrupts file system)
+  * Login, hyou end up on updated, now active server
+  * Logout/login, Pending Update, Continue
+  * Wait for HA to be enabled, check version on new standby
 ==== HDD ====
@@ Line 404: / Line 426: @@
 You can look that information up yourself by opening an SSH session to the passive controller, navigating to the /root/syslog directory and examining the files. The "controller_{a,b}" file shows the output for today. Extract the controller_a.0.bz2 file and read the output of the resulting controller_a.0 file to see the output for yesterday. controller_a.1 would contain the output for the day before yesterday, and so on.
+==== Split Brain ====
+When ending up with an error fail over state try console shutdown first. If that does not work cut power to controllers.  Power down disk array, wait 10 mins, power up, wait 10 mins.  Slide one controller out an inch or so.  Power up other controller which will become the active controller. Wait 10 mins, log in, look around. Slide in other controller and restore redundant power for both controllers.  Wait till HA is enabled. This is how you get out of a split-brain situation.
+==== fndebug ====
+  * first log into support, then download teamviewer
+  * https://support.ixsystems.com/index.php?
+  * get.teamviewer.com/ixsystems
+**Manual debug file creation**, then ftp to ftp.ixsystems.com
+<code>
+freenas-debug -A
+tar czvf fndebug-wesleyan-20201123.tar.gz /var/tmp/fndebug
+# next look at bottom of fndebug/SMART/dump.txt
+/dev/da10 HGST:7200:HUS728T8TAL4201:VAKM187L C:30 dR:2 dW:2503 dL:55 uR:0 uW:0 SMART Status:OK **!!!**
+/dev/da9 HGST:7200:HUS728T8TAL4201:VAKL26ML C:30 dR:3 dW:0 dL:0 uR:0 uW:39 SMART Status:OK **!!!**
+# these drives have not failed yet ut have write errors, offline/replace, see below
+# next look at output of zpool status -x in fndebug/ZFS/dump.txt
+# and the error code
+# https://illumos.org/msg/ZFS-8000-9P
+        NAME                                              STATE     READ WRITE CKSUM
+        tank                                              DEGRADED     0     0     0
+...
+          raidz2-1                                        DEGRADED     0     0     0
+            gptid/16250571-211a-11ea-bbd5-b496915e40c8    DEGRADED     0     0 1.09K  too many errors
+# look for checksums that have failed like this disk in vdev raidz2-1
+# clean up the spare that resilvered (INUSE status)
+# then run a clear on the pool. Then we'll try to get another debug.
+zpool detach tank gptid/173e4974-211a-11ea-bbd5-b496915e40c8
+zpool clear tank
+# that brought all drives back online and the vdevs show
+# then via gui added the available drive back as spare
+</code>
+  * Monitor the progress of the resilvering operation: 'zpool status -x'
+**Replace a failed drive**
+  * https://www.ixsystems.com/documentation/truenas/11.3-U5/storage.html#replacing-a-failed-disk
+  * drives mentioned above have not failed yet so we must "offline" them first
+<code>
+) Go into the Storage > Pools page. Click the Gear icon next to the pool and press the "Status" option.
+) Find da4 and press the three-dot options button next to it, then press "Offline".
+) Go to the System > View Enclosure page, select da4 and press "Identify" to light up the drive on the rack.
+) Physically swap the drive on the rack with its replacement.
+) Go back to the Storage > Pool > Status page, bring up the options for the removed drive,
+a) Select member disk from dropdown, and press "Replace". Success popup, click Close.
+The replacement drive may or may not have been given the name "da4".
+) Wait for the drive to finish resilvering before proceeding to replace da3.
+a) Click spinning icon to view progress. Pool status "healthy" while resilvering.
+Return the drives in original box, return label provided.
+</code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools