Differences

This shows you the differences between two versions of the page.

--- cluster:194 [2020/10/26 11:43]
hmeij07 [Update]
+++ cluster:194 [2023/01/09 14:38]
hmeij07 [certs]
@@ Line 115: / Line 115: @@
 Virtual IP ''hpcstore.wesleyan.edu'' floats back and forth seamlessly (tested, some protocols will loose connectivity). In a split brain situation (no response, both controllers think they are **it**), disconnect one controller from power then reboot.  Then reconnect and wait a few minutes for HA icon to turn green when controller comes online.
-An update goes like this and is not an interruption. Check for and apply updates. They are applied to partner and partner is rebooted. When partner comes back up it becomes the primary. Now you need apply updates to other partner. When it comes back up it remains secondary node.  Check that nodes run the same version.
+Critical for Failover Network Interfaces marked for IGB0 and IGB1 (/zfshomes via NFS) and lagg0 (vlan52)
+You can always Disable Failover, to fix power feed of switches 192.168.0.0/16 or 10.10.0.0/16
+Check Box to Disable Failover
+Go to WebUI > System > Failover > Click the Box > Then Click Save (leave default controller setting as is)
+This will allow you to make your network change without failing over.  Sync to Peer, probably not necessary since you are on the Active controller. Once you are finished, then yes sync with Failover Enabled (no standby reboot).
 ==== SSH ====
@@ Line 143: / Line 151: @@
 </code>
+==== certs ====
+  * Go to System > CA and certs
+  * Add a cert
+  * name is "hpcstore-year", type is CSR
+  * FQDN for hpstore.wesleyan.edu, alternate names hpcstore[1&2].wesleyan.edu
+  * fill in just the basics, no contraints or advance settings
+  * Add, then view CSR section, copy and provide to inCommon admin
 ==== ZFS ====
@@ Line 156: / Line 173: @@
 zfs userspace  tank/zfshomes
 zfs groupspace tank/zfshomes
+# uttlerly bizarre in v12 these commands change
+root@hpcstore2[~]# su - hmeij07
+hpcstore2%
+hpcstore2% zfs get userused@hmeij07 tank/zfshomes
+NAME           PROPERTY          VALUE             SOURCE
+tank/zfshomes  userused@hmeij07  718K              local
+hpcstore2% zfs get userquota@hmeij07 tank/zfshomes
+NAME           PROPERTY           VALUE              SOURCE
+tank/zfshomes  userquota@hmeij07  500G               local
+hpcstore2% zfs get userspace@hmeij07 tank/zfshomes
+bad property list: invalid property 'userspace@hmeij07'
 # hpc100
@@ Line 316: / Line 347: @@
 ==== Snapshots ====
+Snapshots made easier in new releases ... traverse to the hidden directory ''/zfshomes/.zfs/snapshot'' and find the snapshot day desired. Content is read only. Once you ''cd'' into a snapshpot an autofs mount is performed.
+<code>
+.133.52.245:/mnt/tank/zfshomes/.zfs/snapshot/auto-20221126.0200-1y
+T   77T  175T  31%   /zfshomes/.zfs/snapshot/auto-20221126.0200-1y
+</code>
   * Daily snapshots, one per day, kept for a year (for now)
@@ Line 324: / Line 365: @@
     * check permissions on cloned volume, not windows!
     * NOTE: once had mnt/tank/zfshomes also reset to windows, nasty, permissions denied errors
-    * when cloning grant access to 192.168.0.0/16 and 10.10.0.0/16
+    * when cloning grant access to <del>192.168.0.0/16 and 10.10.0.0/16</del> greentail52 129.133.52.226
     * NFS mount, read only
     * maproot ''root:wheel'' (also for mnt/tankl/zfshomes)
-  * Clone mounted on say ''cottontail2:/mnt/clone"date"''
+  * Clone mounted on say ''cottontail2:/mnt/hpc_store_snapshot"''
   * Restore actions by user
   * Delete clone when done
@@ Line 350: / Line 391: @@
 </code>
-==== Update ====
+==== Update 11 ====
+See Update 12 for manual update to v12 with Anthony on 03.09.2021
 **Change the Train** to 11.3, then you will apply the update first in the WebUI to the passive controller.
@@ Line 374: / Line 417: @@
   * Check for Updates, read release Notes, schedule support ticket if needed (major update)
-  * Apply Pending update, save configuration, confirm box
+  * Apply Pending update, save configuration, **confirm box** check for it!
   * active/standby both download and install
   * At 100% stanby reboots, HA disables, file system ok
-  *
+  * Check version on standby, Initiate Fail Over (interrupts file system)
-  *
+  * Login, you end up on new active server, then
+  * Logout/login to check left-top dashboard server
+  * when old active server boots "Pending Update" confirm to complete
+  * check version on new standby
 ==== HDD ====
@@ Line 414: / Line 463: @@
 You can look that information up yourself by opening an SSH session to the passive controller, navigating to the /root/syslog directory and examining the files. The "controller_{a,b}" file shows the output for today. Extract the controller_a.0.bz2 file and read the output of the resulting controller_a.0 file to see the output for yesterday. controller_a.1 would contain the output for the day before yesterday, and so on.
+==== Split Brain ====
+When ending up with an error fail over state try console shutdown first. If that does not work cut power to controllers.  Power down disk array, wait 10 mins, power up, wait 10 mins.  Slide one controller out an inch or so.  Power up other controller which will become the active controller. Wait 10 mins, log in, look around. Slide in other controller and restore redundant power for both controllers.  Wait till HA is enabled. This is how you get out of a split-brain situation.
+==== fndebug ====
+  * first log into support, then download teamviewer
+  * https://support.ixsystems.com/index.php?
+  * get.teamviewer.com/ixsystems
+**Manual debug file creation**, then ftp to ftp.ixsystems.com
+<code>
+freenas-debug -A
+tar czvf fndebug-wesleyan-20201123.tar.gz /var/tmp/fndebug
+# next look at bottom of fndebug/SMART/dump.txt
+/dev/da10 HGST:7200:HUS728T8TAL4201:VAKM187L C:30 dR:2 dW:2503 dL:55 uR:0 uW:0 SMART Status:OK **!!!**
+/dev/da9 HGST:7200:HUS728T8TAL4201:VAKL26ML C:30 dR:3 dW:0 dL:0 uR:0 uW:39 SMART Status:OK **!!!**
+# these drives have not failed yet but have write errors, offline/replace, see below
+# next look at output of zpool status -x in fndebug/ZFS/dump.txt
+# and the error code
+# https://illumos.org/msg/ZFS-8000-9P
+        NAME                                              STATE     READ WRITE CKSUM
+        tank                                              DEGRADED     0     0     0
+...
+          raidz2-1                                        DEGRADED     0     0     0
+            gptid/16250571-211a-11ea-bbd5-b496915e40c8    DEGRADED     0     0 1.09K  too many errors
+# look for checksums that have failed like this disk in vdev raidz2-1
+# clean up the spare that resilvered (INUSE status)
+# then run a clear on the pool. Then we'll try to get another debug.
+zpool detach tank gptid/173e4974-211a-11ea-bbd5-b496915e40c8
+zpool clear tank
+# that brought all drives back online and the vdevs show
+# then via gui added the available drive back as spare
+</code>
+  * Monitor the progress of the resilvering operation: 'zpool status -x'
+**Replace a failed drive**
+  * https://www.ixsystems.com/documentation/truenas/11.3-U5/storage.html#replacing-a-failed-disk
+  * drives mentioned above have not failed yet so we must "offline" them first
+<code>
+) Go into the Storage > Pools page. Click the Gear icon next to the pool and press the "Status" option.
+) Find da4 and press the three-dot options button next to it, then press "Offline".
+) Go to the System > View Enclosure page, select da4 and press "Identify" to light up the drive on the rack.
+) Physically swap the drive on the rack with its replacement.
+) Go back to the Storage > Pool > Status page, bring up the options for the removed drive,
+a) Select member disk from dropdown, and press "Replace". Success popup, click Close.
+The replacement drive may or may not have been given the name "da4".
+) Wait for the drive to finish resilvering before proceeding to replace da3.
+a) Click spinning icon to view progress. Pool status "healthy" while resilvering.
+Return the drives in original box, return label provided.
+</code>
+** Pool Unhealthy but not Degraded status**
+No failed disks, no deploy of spare, but pool unhealthy.  The ''dump.txt'' files for SMART and ZFS show nothing remarkable. But in the console log we observe that disk //da11// has problems. RMA issued. 3rd replacement disk in a year.
+<code>
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): READ(10). CDB: 28 00 1b b0 80 13 00 00 02 00
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): CAM status: SCSI Status Error
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): SCSI status: Check Condition
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): Descriptor 0x80: f7 72
+Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): Error 5, Unretryable error
+) Storage > Pools. Click gear icon next to the pool and press the "Status" option.
+) Find da11 and press the three-dot options button next to it, then press "Offline".
+) System > View Enclosure, find&select da11, press "Identify".
+) Physically swap the drive on the rack with its replacement.
+) Storage > Pool > Status page, bring up three-dot options for the removed drive,
+a) Select member disk from drop down, and press "Replace". Success popup, click Close.
+) Wait till resilver finishes.
+</code>
+==== Console hangs ====
+.7
+As for the issue of the "Please Wait" box spins forever (when creating a new user via GUI) I would try refreshing the WebUI service and seeing if that fixes the issue. You can do this by running the following commands via a SSH session to the VIP.
+service middlewared stop\\
+service middlewared start
+"try" not very convincing ... I open another tab, check user, close previous tab ... also the .[a-z]* csh hidden files are not created - no bother, we use ''bash''
+==== Update 12 ====
+System > Update > Select (new train 12.0-STABLE)
+** Open a console on both controllers without double ssh sessions, directly to hpcstore1/2**
+''zpool status''
+''ifconfig ntb0'', internal heartbeat IPS 169.254.10.1 and 169.254.10.2
+Then download updates on passive, check version ''hactl'' run check first, then update via cli
+''freenas-update -v -T TrueNAS-12.0-STABLE check''
+''freenas-update -v -T TrueNAS-12.0-STABLE update''
+...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
+reboot passive
+from active ping  passive heartbeat IP, when up
+check version passive
+check boot env ''beadm list'' (shows N now, R reboot for 12.0)
+on passive ''tail -f /var/log/messages''
+now force fail over via GUI (interruptive for 6o seconds)
+Anthony did a reboot on active instead, watch log for personality swap
+then update the new passive
+''freenas-update -v -T TrueNAS-12.0-STABLE check''
+''freenas-update -v -T TrueNAS-12.0-STABLE update''
+then check version, reboot new passive, check version, become new standby
+Result: personality switch active vs standby, took 35 mins
+In two months: ZFS feature updates pathch, not interruptive, <del>do around 04/09/2021</del>\\
+Upgrade done
+ --- //[[hmeij@wesleyan.edu|Henk]] 2021/06/07 07:40//
+Storage > Pool > "wheel" > Upgrade Pool
+** 12.0-U4.1 **
+  * ditto above, see major release upgrade below
+  * but old active did not come up, reset controller
+  * click on "Pending Update", then it came up, an HA enabled
+  * hmm something about failed to connect failoverscratchdisk?
+** 12.0-U5.1**
+  * standby reboot 5 mins
+  * fail over 1 min
+  * new standby "apply pending updates" 10 mins
+  * this version went fine
+__Not created/set, see below__ ...\\
+While the underlying issues have been fixed, this setting continues to be disabled by default for additional performance investigation. To manually reactivate persistent L2ARC, log in to the TrueNAS Web Interface, go to System > Tunables, and add a new tunable with these values:
+<code>
+Type = sysctl
+Variable = vfs.zfs.l2arc.rebuild_enabled
+Value = 1
+</code>
+From support: In an HA environment, this tunable actually delays failover while it ensures L2ARC. The tunable for the "persistent L2ARC".  Preloads your ARC with what you had before, but slows down imports and failovers. Not super useful if you don't reboot or failover often.
+** 12.0-U6 **
+  * same as 5.1, went fine,
+  * new standby reboot 5 mins
+** 12.0-U6.1 **
+  * same as 6, went fine,
+  * little flakiness on failover, apply pending appeared twice
+  * let it go 10 mins, use ping hostname to test
+  * new standby reboot 5 mins
+** 12.0-U7 **
+  * major OpenZFS update
+  * same as update 12.0
+  * no problems
+  * cpu was unusually busy before upgrade
+    * terminated some rsyncs
+** 12.0-U8 **
+  * 02/23/2022
+  * no problems
+** 12.0-U8.1 **
+  * 05/03/2022
+  * failover success at 10 mins
+  * then no Pending box, just a Continue button
+  * watch console messages, at 17 mins HA enabled
+==== Update 13 ====
+System > Update > Select (new train 13.0-STABLE)
+<code>
+# in shell
+ freenas-update -v -T TrueNAS-13.0-STABLE check
+ freenas-update -v -T TrueNAS-13.0-STABLE update
+…10%…20%…30%…40%…50%…60%…70%…80%…90%…100%
+beadm list
+# (Active N = 12.0-U8.1 and R = 13.0-U3.1)
+</code>
+once both have finished, reboot passive, web gui log back in
+once passive back up, reboot active
+web gui log back into new active, wait for HA to be enabled
+debug plus screenshots for snapshot visibility which is visible (working in 13.0-U3.1) but database setting is still invisble
+took less than 35 mins
+<code>
+bstop 0
+bresume 0
+# manual, one at a time
+scontrol hold joblist
+# one at a time
+# for i in `squeue | grep '   R   ' | awk '{print $1}'`; do echo $i; done
+# then grep '   S   '
+scontrol suspend joblist
+scontrol resume  joblist
+scontrol release joblist
+</code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools