Differences

This shows you the differences between two versions of the page.

--- cluster:194 [2020/03/12 14:00]
hmeij07 [ZFS]
+++ cluster:194 [2021/02/01 15:44]
hmeij07 [HA]
@@ Line 5: / Line 5: @@
 Notes. Mainly for me but might be useful/of interest to users.
+Message:
+Our current file server is sharptail.wesleyan.edu which serves out home directories (/home, 10T). A new file server hpcstore.wesleyan.edu will be deployed taking over this function (/zfshomes, 190T). This notice is to inform you your home directory has been cut over.
+There are no changes for you. When you log into cottontail or cottontail2 you end up in your new home directory. $HOME and ~username work as usual. The only difference is that your old home was at /home/username and now it is at /zfshomes/username.
+If you wish to load/unload large content from your new home directory please log into hpcstore.wesleyan.edu directly (via ssh/sftp) or preferably use rsync with a bandwidth throttle no larger than "--bwlimit=5000".
+Details at\\
+https://dokuwiki.wesleyan.edu/doku.php?id=cluster:194
+==== Summary ====
+  * **SSH** (sftp/scp)
+<code>
+# from outside via VPN
+$ ssh hpc21@hpcstore.wesleyan.edu
+hpc21@hpcstore.wesleyan.edu's password:
+FreeBSD 11.2-STABLE (TrueNAS.amd64)
+(banner snip ...)
+Welcome to TrueNAS
+# note we ended up on node "B"
+[hpc21@hpcstore2 ~]$ pwd
+/mnt/tank/zfshomes/hpc21
+[hpc21@hpcstore2 ~]$ echo $HOME
+/mnt/tank/zfshomes/hpc21
+# quota check
+[hpc21@hpcstore2 ~]$  zfs userspace tank/zfshomes | egrep -i "quota|$USER"
+TYPE        NAME      USED  QUOTA
+POSIX User  hpc21     282K   500G
+# from inside HPCC with ssh keys properly set up
+[hpc21@cottontail ~]$ ssh hpcstore
+Last login: Mon Mar 23 10:58:27 2020 from 129.133.52.222
+[hpc21@cottontail ~]$ echo $HOME
+/zfshomes/hpc21
+[hpc21@hpcstore2 ~]$ df -h .
+Filesystem       Size    Used   Avail Capacity  Mounted on
+tank/zfshomes    177T    414G    177T     0%    /mnt/tank/zfshomes
+</code>
+  * **RSYNC**
+<code>
+[hmeij@ThisPC]$ rsync -vac --dry-run --whole-file --bwlimit=4096  \
+c:\Users\hmeij\ hpcstore:/mnt/tank/zfshomes/hmeij/
+sending incremental file list
+...
+</code>
+  * **SMB/CIFS**
+    * all users have shares but not class accounts (hpc101-hpc200)
+Not any more. Serious conflict between NFS and SMB ACLs if both protocols enabled on same dataset. So **nobody** has a samba share. If you want to drop&drag you need to use something like CyberDuck and make an sftp connection.
+ --- //[[hmeij@wesleyan.edu|Henk]] 2020/05/28 11:10//
+<code>
+# windows command line
+C:\Users\hmeij07>net use W: \\hpcstore.wesleyan.edu\hmeij /user:hmeij
+Enter the password for 'hmeij' to connect to 'hpcstore.wesleyan.edu':
+The command completed successfully.
+# or ThisPC > Map Network Drive
+\\hpcstore.wesleyan.edu\username
+# user is hpcc username, password is hpcc password
+</code>
 ==== Consoles ====
-port 5, 11, web site (with shell)
+  * port 5
+    * set up mac
+    * plug in pin2usb cable (look for device /dev/cu.usbserial-*)
+    * launch terminal, invoke screen
+      * screen /dev/cu.usbserial-* baudrate -L 38400
+      * sysadmin/superuser, now you can set basic stuff via ipmi
+      * ifconfig ehto | grep 'inet addr' or
+      * ipmitool -H 127.0.0.1 -U admin -P admin lan print
+      * ipmitool -H 127.0.0.1 -U admin -P admin lan set 1 ipaddr ... (etc + netmask + defgw)
+    * to set initial ips/netmasks
+  * port 10 (if 12 option netcli boot menu does not show)
+    * unplug console cable, plug in pin2serial cable
+    * set up windows laptop, launch hyperterminal, baud rate 115000
+    * 12 menu ''netcli'', change/reset root passsword here
+  * port 80->443, web site (with shell of ''netcli'')
+    * gui
+    * shell
+    * all non-zfs commands are persistent across boots
+    * except ssh keys and directory permissions
 ==== HA ====
@@ Line 16: / Line 115: @@
 Virtual IP ''hpcstore.wesleyan.edu'' floats back and forth seamlessly (tested, some protocols will loose connectivity). In a split brain situation (no response, both controllers think they are **it**), disconnect one controller from power then reboot.  Then reconnect and wait a few minutes for HA icon to turn green when controller comes online.
+Critical for Failover Network Interfaces marked for IGB0 and IGB1 (/zfshomes via NFS) and lagg0 (vlan52)
+You can always Disable Failover, to fix power feed of switches 192.168.0.0/16 or 10.10.0.0/16
+Check Box to Disable Failover
+Go to WebUI > System > Failover > Click the Box > Then Click Save
+This will allow you to make your network change without failing over. Then when finished, Enable Failover again.
 ==== SSH ====
-Allowed for large content transfers using ''scp'' or ''sftp'' or just checking things out.
+Allowed for large content transfers using ''scp'' or ''sftp'' or just checking things out. \\
 TODO: rsync?
-Home directories are located in ''/mnt/tank/zfshomes''.  When users get cut over their location will be updated in the ''/etc/passwd'' file and $HOME becomes ''/zfshomes/username''.  So we can keep track of that. Followed by an rsync process that will from TrueNAS/ZFS appliance rsync to ''sharptail:/home''.
+Home directories are located in ''/mnt/tank/zfshomes''.  When users get cut over their location will be updated in the ''/etc/passwd'' file and $HOME becomes ''/zfshomes/username''.  So we can keep track of that. Followed by an rsync process that will from TrueNAS/ZFS appliance rsync to ''sharptail:/home''.  \\
-TODO: write script.
+TODO: write script.  \\\
-TODO: add disksold sharptail:/home, enlarge and merge LVMs.
+TODO: add disksold sharptail:/home, enlarge and merge LVMs. \\
-TODO: backup target
+TODO: backup target\\
 <code>
-# create user, add primary secondary groups, set permissions
+# create user, no new but set primary + auxillary groups, full-name
-# then move all dot files into ~/._nas
+# set shell, set permissions, some random passwd date +%N with symbols
-# copy content over from sharptail
+# then move all dot files into ~/._nas scp ~/.ssh over
+# copy content over from sharptail, @hpcstore...
-# SSH keys in place so should be passwordless
+rsync -ac --bwlimit=4096 --whole-file --stats sharptail:/home/hmeij/  /mnt/tank/zfshomes/hmeij/
+# SSH keys in place so should be passwordless, test
 ssh username@hpcstore.wesleyan.edu
@@ Line 44: / Line 152: @@
 ==== ZFS ====
+  * https://docs.oracle.com/cd/E19253-01/819-5461/index.html
 <code>
@@ Line 76: / Line 186: @@
 tank/zfshomes@auto-20200311.1348-1y             165K      -  14.9G  -
+# health
+zpool status -v tank
+  pool: tank
+ state: ONLINE
+  scan: scrub repaired 0 in 0 days 00:00:02 with 0 errors on Sun Feb  2 03:00:04 2020
+config:
+        NAME                                            STATE     READ WRITE CKSUM
+        tank                                            ONLINE       0     0     0
+          raidz2-0                                      ONLINE       0     0     0
+            gptid/104a748f-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/10d0c16e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/115414b8-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/11dd105d-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/12636cff-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/12e6d913-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/13676269-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/13ee7fb2-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/14706a76-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1504c334-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1592a623-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+          raidz2-1                                      ONLINE       0     0     0
+            gptid/16250571-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/16b4a392-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/173e4974-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/17cb4efb-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1861c750-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/18ef1edd-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/197d9fc9-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1a09eebb-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1a99e25d-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1b2dd0b5-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1bbaa252-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+          raidz2-2                                      ONLINE       0     0     0
+            gptid/1c60422c-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1cedf16e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1d807f27-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1e0d0a20-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1e9dec87-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1f603e96-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/1ff8b82e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/2087c210-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/21128be3-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/21ab0c6c-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+            gptid/2241e3e2-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
+        logs
+          gptid/238a4161-211a-11ea-bbd5-b496915e40c8    ONLINE       0     0     0
+        cache
+          gptid/23426c62-211a-11ea-bbd5-b496915e40c8    ONLINE       0     0     0
+        spares
+          gptid/22f36c47-211a-11ea-bbd5-b496915e40c8    AVAIL
+errors: No known data errors
 </code>
 ==== SMB ====
-SMB/CIFS (Samba) shares are also created once the homedir is up.
+SMB/CIFS (Samba) shares are also created once the homedir is up. NOT!
+  * do not mix SMB and NFS on same dataset, not supported
+  * problems 'permission denied' in NFS on top level of share (deeper is ok)
+  * windows ACLs on top of unix file system = bad
+<code>
+#         v that plus is the problem
+drwxr-xr-x+ 147 root  wheel  147 Apr 27 08:17 /mnt/tank/zfshomes
+# either use ACL editor to strip off in v13.1-U2 or
+setfacl -bn /mnt/tank/zfshomes/
+followed by for example
+find /mnt/tank/zfshomes/hmeij/ -type d -exec setfacl -bn {} \;
+# also unsupported via shell
+</code>
+  * For each user
+    * mnt/tanks/zfshomes/username
+    * uncheck default permissions
+    * valid users = usname, @ugroup(s)
+    * ''/usr/local/etc/smb4.conf''
+**Note** At user creation a random password is set. Please ask to have it reset to access SMB shares. (there should be some **self-serve password reset** functionality with email confirmation but I cannot find it for now. Any passwords changed outside of database will not be persistent across boots.
 <code>
@@ Line 95: / Line 289: @@
 </code>
+----
+Change $HOME location in ''/etc/passwd'' and propagate.\\
+**Note** remove access to old $HOME ... chown root:root + chmod o-rwx \\
+END OF USER ACCOUNT SETUP
+----
+==== NFS ====
+  * maproot is needed
+  * export to both private networks
+<code>
+root@hpcstore1[~]# cat /etc/exports
+/mnt/tank/zfshomes -maproot="root":"wheel" -network 192.168.0.0/16
+/mnt/tank/zfshomes -maproot="root":"wheel" -network 10.10.0.0/16
+/mnt/tank/zfshomes-auto-20200310.1348-1y-clone -ro \
+  -maproot="root":"wheel" -network 192.168.0.0/16
+/mnt/tank/zfshomes-auto-20200310.1348-1y-clone -ro \
+  -maproot="root":"wheel" -network 10.10.0.0/16
+</code>
 ==== Rollback ====
@@ Line 134: / Line 355: @@
 /mnt/clone0310      nfs ro,tcp,soft,intr,bg,vers=3
-<\code>
+</code>
+==== Update ====
+**Change the Train** to 11.3, then you will apply the update first in the WebUI to the passive controller.
+After its reboots, you will failover to it by **rebooting** the Active controller (the **current** WebUI).
+This will failover to the updated 11.3-U2.1 controller (brief interruption).
+From there, you would go to System > Update and do the same for the NEW passive controller.
+After that initiate failover back to primary via dashboard (brief interruption).
+Enable HA, click icon
+**Apply Pending Updates**
+Upgrades both controllers. Files are downloaded to the Active Controller and then transferred to the Standby Controller. The upgrade process starts concurrently on both TrueNAS Controllers.
+Server responds while HA disabled.  You are instructed to Initiate Fail Over, do so, just take 5 seconds. The Continue with pending upgrade ... wait 5 mins or so, watch console activity.  **THEN** Log out and log back in once the passive standby is on new update.
+Update takes 15 mins in total.
+** 11.3 U5 **
+  * Check for Updates, read release Notes, schedule support ticket if needed (major update)
+  * Apply Pending update, save configuration, **confirm box** check for it!
+  * active/standby both download and install
+  * At 100% stanby reboots, HA disables, file system ok
+  * Check version on standby, Initiate Fail Over (interrupts file system)
+  * Login, hyou end up on updated, now active server
+  * Logout/login, Pending Update, Continue
+  * Wait for HA to be enabled, check version on new standby
+==== HDD ====
+Two types, hard to find in stock.
+<code>
+
+T SAS
+da0: <HGST HUS728T8TAL4201 B460> Fixed Direct Access SPC-4 SCSI device
+da0: Serial Number VAKM5GTL
+da0: 1200.000MB/s transfers
+da0: Command Queueing enabled
+da0: 7630885MB (1953506646 4096 byte sectors)
+exxactcorp
+https://www.exxactcorp.com/search?q=HUS728T8TAL4201
+G SSD
+da2: <WDC WUSTR6480ASS201 B925> Fixed Direct Access SPC-5 SCSI device
+da2: Serial Number V6V1XGDA
+da2: Command Queueing enabled
+da2: 763097MB (1562824368 512 byte sectors)
+exxactcorp
+https://www.exxactcorp.com/search?q=WUSTR6480ASS201
+</code>
+==== Logs ====
+From support:
+That information is logged via syslog for the opposite controller. For example, to find the information I did here, I looked in the syslog output on the controller that was passive at the time these alerts occurred.
+You can look that information up yourself by opening an SSH session to the passive controller, navigating to the /root/syslog directory and examining the files. The "controller_{a,b}" file shows the output for today. Extract the controller_a.0.bz2 file and read the output of the resulting controller_a.0 file to see the output for yesterday. controller_a.1 would contain the output for the day before yesterday, and so on.
+==== Split Brain ====
+When ending up with an error fail over state try console shutdown first. If that does not work cut power to controllers.  Power down disk array, wait 10 mins, power up, wait 10 mins.  Slide one controller out an inch or so.  Power up other controller which will become the active controller. Wait 10 mins, log in, look around. Slide in other controller and restore redundant power for both controllers.  Wait till HA is enabled. This is how you get out of a split-brain situation.
+==== fndebug ====
+  * first log into support, then download teamviewer
+  * https://support.ixsystems.com/index.php?
+  * get.teamviewer.com/ixsystems
+**Manual debug file creation**, then ftp to ftp.ixsystems.com
+<code>
+freenas-debug -A
+tar czvf fndebug-wesleyan-20201123.tar.gz /var/tmp/fndebug
+# next look at bottom of fndebug/SMART/dump.txt
+/dev/da10 HGST:7200:HUS728T8TAL4201:VAKM187L C:30 dR:2 dW:2503 dL:55 uR:0 uW:0 SMART Status:OK **!!!**
+/dev/da9 HGST:7200:HUS728T8TAL4201:VAKL26ML C:30 dR:3 dW:0 dL:0 uR:0 uW:39 SMART Status:OK **!!!**
+# these drives have not failed yet ut have write errors, offline/replace, see below
+# next look at output of zpool status -x in fndebug/ZFS/dump.txt
+# and the error code
+# https://illumos.org/msg/ZFS-8000-9P
+        NAME                                              STATE     READ WRITE CKSUM
+        tank                                              DEGRADED     0     0     0
+...
+          raidz2-1                                        DEGRADED     0     0     0
+            gptid/16250571-211a-11ea-bbd5-b496915e40c8    DEGRADED     0     0 1.09K  too many errors
+# look for checksums that have failed like this disk in vdev raidz2-1
+# clean up the spare that resilvered (INUSE status)
+# then run a clear on the pool. Then we'll try to get another debug.
+zpool detach tank gptid/173e4974-211a-11ea-bbd5-b496915e40c8
+zpool clear tank
+# that brought all drives back online and the vdevs show
+# then via gui added the available drive back as spare
+</code>
+  * Monitor the progress of the resilvering operation: 'zpool status -x'
+**Replace a failed drive**
+  * https://www.ixsystems.com/documentation/truenas/11.3-U5/storage.html#replacing-a-failed-disk
+  * drives mentioned above have not failed yet so we must "offline" them first
+<code>
+) Go into the Storage > Pools page. Click the Gear icon next to the pool and press the "Status" option.
+) Find da4 and press the three-dot options button next to it, then press "Offline".
+) Go to the System > View Enclosure page, select da4 and press "Identify" to light up the drive on the rack.
+) Physically swap the drive on the rack with its replacement.
+) Go back to the Storage > Pool > Status page, bring up the options for the removed drive,
+a) Select member disk from dropdown, and press "Replace". Success popup, click Close.
+The replacement drive may or may not have been given the name "da4".
+) Wait for the drive to finish resilvering before proceeding to replace da3.
+a) Click spinning icon to view progress. Pool status "healthy" while resilvering.
+Return the drives in original box, return label provided.
+</code>
 \\

DokuWiki

User Tools

Site Tools

Differences

Page Tools