This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:194 [2020/03/13 13:45] hmeij07 [ZFS] |
cluster:194 [2022/11/28 14:36] hmeij07 [Update 13] |
||
---|---|---|---|
Line 5: | Line 5: | ||
Notes. Mainly for me but might be useful/of interest to users. | Notes. Mainly for me but might be useful/of interest to users. | ||
+ | |||
+ | Message: | ||
+ | |||
+ | Our current file server is sharptail.wesleyan.edu which serves out home directories (/home, 10T). A new file server hpcstore.wesleyan.edu will be deployed taking over this function (/zfshomes, 190T). This notice is to inform you your home directory has been cut over. | ||
+ | |||
+ | There are no changes for you. When you log into cottontail or cottontail2 you end up in your new home directory. $HOME and ~username work as usual. The only difference is that your old home was at / | ||
+ | |||
+ | If you wish to load/unload large content from your new home directory please log into hpcstore.wesleyan.edu directly (via ssh/sftp) or preferably use rsync with a bandwidth throttle no larger than " | ||
+ | |||
+ | Details at\\ | ||
+ | https:// | ||
+ | |||
+ | ==== Summary ==== | ||
+ | |||
+ | * **SSH** (sftp/scp) | ||
+ | |||
+ | < | ||
+ | |||
+ | # from outside via VPN | ||
+ | $ ssh hpc21@hpcstore.wesleyan.edu | ||
+ | |||
+ | hpc21@hpcstore.wesleyan.edu' | ||
+ | FreeBSD 11.2-STABLE (TrueNAS.amd64) | ||
+ | (banner snip ...) | ||
+ | Welcome to TrueNAS | ||
+ | |||
+ | # note we ended up on node " | ||
+ | [hpc21@hpcstore2 ~]$ pwd | ||
+ | / | ||
+ | [hpc21@hpcstore2 ~]$ echo $HOME | ||
+ | / | ||
+ | |||
+ | # quota check | ||
+ | [hpc21@hpcstore2 ~]$ zfs userspace tank/ | ||
+ | TYPE NAME USED QUOTA | ||
+ | POSIX User hpc21 | ||
+ | |||
+ | |||
+ | # from inside HPCC with ssh keys properly set up | ||
+ | [hpc21@cottontail ~]$ ssh hpcstore | ||
+ | Last login: Mon Mar 23 10:58:27 2020 from 129.133.52.222 | ||
+ | |||
+ | [hpc21@cottontail ~]$ echo $HOME | ||
+ | / | ||
+ | |||
+ | [hpc21@hpcstore2 ~]$ df -h . | ||
+ | Filesystem | ||
+ | tank/ | ||
+ | |||
+ | </ | ||
+ | |||
+ | * **RSYNC** | ||
+ | |||
+ | < | ||
+ | |||
+ | [hmeij@ThisPC]$ rsync -vac --dry-run --whole-file --bwlimit=4096 | ||
+ | c: | ||
+ | sending incremental file list | ||
+ | ... | ||
+ | |||
+ | </ | ||
+ | |||
+ | * **SMB/ | ||
+ | * all users have shares but not class accounts (hpc101-hpc200) | ||
+ | |||
+ | Not any more. Serious conflict between NFS and SMB ACLs if both protocols enabled on same dataset. So **nobody** has a samba share. If you want to drop& | ||
+ | |||
+ | --- // | ||
+ | |||
+ | < | ||
+ | |||
+ | # windows command line | ||
+ | C: | ||
+ | Enter the password for ' | ||
+ | The command completed successfully. | ||
+ | |||
+ | # or ThisPC > Map Network Drive | ||
+ | \\hpcstore.wesleyan.edu\username | ||
+ | # user is hpcc username, password is hpcc password | ||
+ | |||
+ | </ | ||
==== Consoles ==== | ==== Consoles ==== | ||
- | port 5, 11, web site (with shell) | + | * port 5 |
+ | * set up mac | ||
+ | * plug in pin2usb cable (look for device / | ||
+ | * launch terminal, invoke screen | ||
+ | * screen / | ||
+ | * sysadmin/ | ||
+ | * ifconfig ehto | grep 'inet addr' or | ||
+ | * ipmitool -H 127.0.0.1 -U admin -P admin lan print | ||
+ | * ipmitool -H 127.0.0.1 -U admin -P admin lan set 1 ipaddr ... (etc + netmask + defgw) | ||
+ | * to set initial ips/ | ||
+ | * port 10 (if 12 option netcli boot menu does not show) | ||
+ | * unplug console cable, plug in pin2serial cable | ||
+ | * set up windows laptop, launch hyperterminal, | ||
+ | * 12 menu '' | ||
+ | * port 80->443, web site (with shell of '' | ||
+ | * gui | ||
+ | * shell | ||
+ | * all non-zfs commands are persistent across boots | ||
+ | * except ssh keys and directory permissions | ||
==== HA ==== | ==== HA ==== | ||
Line 15: | Line 114: | ||
Virtual IP '' | Virtual IP '' | ||
+ | |||
+ | Critical for Failover Network Interfaces marked for IGB0 and IGB1 (/zfshomes via NFS) and lagg0 (vlan52) | ||
+ | |||
+ | You can always Disable Failover, to fix power feed of switches 192.168.0.0/ | ||
+ | |||
+ | Check Box to Disable Failover | ||
+ | Go to WebUI > System > Failover > Click the Box > Then Click Save (leave default controller setting as is) | ||
+ | |||
+ | This will allow you to make your network change without failing over. Sync to Peer, probably not necessary since you are on the Active controller. Once you are finished, then yes sync with Failover Enabled (no standby reboot). | ||
==== SSH ==== | ==== SSH ==== | ||
Line 28: | Line 136: | ||
< | < | ||
- | # create user, add primary | + | # create user, no new but set primary |
- | # set shell, set permissions, | + | # set shell, set permissions, |
# then move all dot files into ~/._nas scp ~/.ssh over | # then move all dot files into ~/._nas scp ~/.ssh over | ||
# copy content over from sharptail, @hpcstore... | # copy content over from sharptail, @hpcstore... | ||
- | rsync -vac sharptail:/ | + | rsync -ac --bwlimit=4096 --whole-file --stats |
# SSH keys in place so should be passwordless, | # SSH keys in place so should be passwordless, | ||
ssh username@hpcstore.wesleyan.edu | ssh username@hpcstore.wesleyan.edu | ||
Line 56: | Line 164: | ||
zfs userspace | zfs userspace | ||
zfs groupspace tank/ | zfs groupspace tank/ | ||
+ | |||
+ | # uttlerly bizarre in v12 these commands change | ||
+ | |||
+ | root@hpcstore2[~]# | ||
+ | hpcstore2% | ||
+ | hpcstore2% zfs get userused@hmeij07 tank/ | ||
+ | NAME | ||
+ | tank/ | ||
+ | hpcstore2% zfs get userquota@hmeij07 tank/ | ||
+ | NAME | ||
+ | tank/ | ||
+ | hpcstore2% zfs get userspace@hmeij07 tank/ | ||
+ | bad property list: invalid property ' | ||
+ | |||
# hpc100 | # hpc100 | ||
Line 79: | Line 201: | ||
tank/ | tank/ | ||
+ | # health | ||
+ | zpool status -v tank | ||
+ | pool: tank | ||
+ | | ||
+ | scan: scrub repaired 0 in 0 days 00:00:02 with 0 errors on Sun Feb 2 03:00:04 2020 | ||
+ | config: | ||
+ | |||
+ | NAME STATE READ WRITE CKSUM | ||
+ | tank ONLINE | ||
+ | raidz2-0 | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | raidz2-1 | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | raidz2-2 | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | gptid/ | ||
+ | logs | ||
+ | gptid/ | ||
+ | cache | ||
+ | gptid/ | ||
+ | spares | ||
+ | gptid/ | ||
+ | |||
+ | errors: No known data errors | ||
</ | </ | ||
+ | |||
==== SMB ==== | ==== SMB ==== | ||
- | SMB/CIFS (Samba) shares are also created once the homedir is up. | + | SMB/CIFS (Samba) shares are also created once the homedir is up. NOT! |
+ | |||
+ | * do not mix SMB and NFS on same dataset, not supported | ||
+ | * problems ' | ||
+ | * windows ACLs on top of unix file system = bad | ||
+ | |||
+ | < | ||
+ | |||
+ | # v that plus is the problem | ||
+ | drwxr-xr-x+ 147 root wheel 147 Apr 27 08:17 / | ||
+ | |||
+ | # either use ACL editor to strip off in v13.1-U2 or | ||
+ | |||
+ | setfacl -bn / | ||
+ | |||
+ | followed by for example | ||
+ | |||
+ | find / | ||
+ | |||
+ | # also unsupported via shell | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | * For each user | ||
+ | * mnt/ | ||
+ | * uncheck default permissions | ||
+ | * valid users = usname, @ugroup(s) | ||
+ | * ''/ | ||
+ | |||
+ | **Note** At user creation a random password is set. Please ask to have it reset to access SMB shares. (there should be some **self-serve password reset** functionality with email confirmation but I cannot find it for now. Any passwords changed outside of database will not be persistent across boots. | ||
< | < | ||
Line 99: | Line 305: | ||
- | ==== Sharing (NFS) ==== | + | ---- |
+ | Change $HOME location in ''/ | ||
+ | **Note** remove access to old $HOME ... chown root:root + chmod o-rwx \\ | ||
+ | END OF USER ACCOUNT SETUP | ||
+ | ---- | ||
+ | |||
+ | |||
+ | ==== NFS ==== | ||
* maproot is needed | * maproot is needed | ||
Line 133: | Line 346: | ||
* check permissions on cloned volume, not windows! | * check permissions on cloned volume, not windows! | ||
* NOTE: once had mnt/ | * NOTE: once had mnt/ | ||
- | * when cloning grant access to 192.168.0.0/ | + | * when cloning grant access to <del>192.168.0.0/ |
* NFS mount, read only | * NFS mount, read only | ||
* maproot '' | * maproot '' | ||
- | * Clone mounted on say '' | + | * Clone mounted on say '' |
* Restore actions by user | * Restore actions by user | ||
* Delete clone when done | * Delete clone when done | ||
Line 158: | Line 371: | ||
</ | </ | ||
+ | |||
+ | ==== Update 11 ==== | ||
+ | |||
+ | See Update 12 for manual update to v12 with Anthony on 03.09.2021 | ||
+ | |||
+ | **Change the Train** to 11.3, then you will apply the update first in the WebUI to the passive controller. | ||
+ | |||
+ | After its reboots, you will failover to it by **rebooting** the Active controller (the **current** WebUI). | ||
+ | |||
+ | This will failover to the updated 11.3-U2.1 controller (brief interruption). | ||
+ | |||
+ | From there, you would go to System > Update and do the same for the NEW passive controller. | ||
+ | |||
+ | After that initiate failover back to primary via dashboard (brief interruption). | ||
+ | |||
+ | Enable HA, click icon | ||
+ | |||
+ | **Apply Pending Updates** | ||
+ | Upgrades both controllers. Files are downloaded to the Active Controller and then transferred to the Standby Controller. The upgrade process starts concurrently on both TrueNAS Controllers. | ||
+ | |||
+ | Server responds while HA disabled. | ||
+ | |||
+ | Update takes 15 mins in total. | ||
+ | |||
+ | ** 11.3 U5 ** | ||
+ | |||
+ | * Check for Updates, read release Notes, schedule support ticket if needed (major update) | ||
+ | * Apply Pending update, save configuration, | ||
+ | * active/ | ||
+ | * At 100% stanby reboots, HA disables, file system ok | ||
+ | * Check version on standby, Initiate Fail Over (interrupts file system) | ||
+ | * Login, you end up on new active server, then | ||
+ | * Logout/ | ||
+ | * when old active server boots " | ||
+ | * check version on new standby | ||
+ | |||
+ | |||
+ | |||
+ | ==== HDD ==== | ||
+ | |||
+ | Two types, hard to find in stock. | ||
+ | |||
+ | < | ||
+ | |||
+ | | ||
+ | |||
+ | 8T SAS | ||
+ | da0: <HGST HUS728T8TAL4201 B460> Fixed Direct Access SPC-4 SCSI device | ||
+ | da0: Serial Number VAKM5GTL | ||
+ | da0: 1200.000MB/ | ||
+ | da0: Command Queueing enabled | ||
+ | da0: 7630885MB (1953506646 4096 byte sectors) | ||
+ | exxactcorp | ||
+ | https:// | ||
+ | | ||
+ | 800G SSD | ||
+ | da2: <WDC WUSTR6480ASS201 B925> Fixed Direct Access SPC-5 SCSI device | ||
+ | da2: Serial Number V6V1XGDA | ||
+ | da2: Command Queueing enabled | ||
+ | da2: 763097MB (1562824368 512 byte sectors) | ||
+ | exxactcorp | ||
+ | https:// | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Logs ==== | ||
+ | |||
+ | From support: | ||
+ | |||
+ | That information is logged via syslog for the opposite controller. For example, to find the information I did here, I looked in the syslog output on the controller that was passive at the time these alerts occurred. | ||
+ | |||
+ | You can look that information up yourself by opening an SSH session to the passive controller, navigating to the / | ||
+ | |||
+ | |||
+ | ==== Split Brain ==== | ||
+ | |||
+ | When ending up with an error fail over state try console shutdown first. If that does not work cut power to controllers. | ||
+ | |||
+ | ==== fndebug ==== | ||
+ | |||
+ | * first log into support, then download teamviewer | ||
+ | * https:// | ||
+ | * get.teamviewer.com/ | ||
+ | |||
+ | |||
+ | **Manual debug file creation**, then ftp to ftp.ixsystems.com | ||
+ | |||
+ | < | ||
+ | |||
+ | freenas-debug -A | ||
+ | tar czvf fndebug-wesleyan-20201123.tar.gz / | ||
+ | |||
+ | # next look at bottom of fndebug/ | ||
+ | /dev/da10 HGST: | ||
+ | /dev/da9 HGST: | ||
+ | # these drives have not failed yet but have write errors, offline/ | ||
+ | |||
+ | # next look at output of zpool status -x in fndebug/ | ||
+ | # and the error code | ||
+ | # https:// | ||
+ | |||
+ | NAME STATE READ WRITE CKSUM | ||
+ | tank DEGRADED | ||
+ | ... | ||
+ | raidz2-1 | ||
+ | gptid/ | ||
+ | # look for checksums that have failed like this disk in vdev raidz2-1 | ||
+ | |||
+ | # clean up the spare that resilvered (INUSE status) | ||
+ | # then run a clear on the pool. Then we'll try to get another debug. | ||
+ | |||
+ | zpool detach tank gptid/ | ||
+ | zpool clear tank | ||
+ | |||
+ | # that brought all drives back online and the vdevs show | ||
+ | # then via gui added the available drive back as spare | ||
+ | |||
+ | </ | ||
+ | |||
+ | * Monitor the progress of the resilvering operation: 'zpool status -x' | ||
+ | |||
+ | |||
+ | **Replace a failed drive** | ||
+ | |||
+ | * https:// | ||
+ | * drives mentioned above have not failed yet so we must " | ||
+ | |||
+ | < | ||
+ | |||
+ | 1) Go into the Storage > Pools page. Click the Gear icon next to the pool and press the " | ||
+ | 2) Find da4 and press the three-dot options button next to it, then press " | ||
+ | 3) Go to the System > View Enclosure page, select da4 and press " | ||
+ | 4) Physically swap the drive on the rack with its replacement. | ||
+ | 5) Go back to the Storage > Pool > Status page, bring up the options for the removed drive, | ||
+ | 5a) Select member disk from dropdown, and press " | ||
+ | The replacement drive may or may not have been given the name " | ||
+ | 6) Wait for the drive to finish resilvering before proceeding to replace da3. | ||
+ | 6a) Click spinning icon to view progress. Pool status " | ||
+ | Return the drives in original box, return label provided. | ||
+ | |||
+ | </ | ||
+ | |||
+ | ** Pool Unhealthy but not Degraded status** | ||
+ | |||
+ | No failed disks, no deploy of spare, but pool unhealthy. | ||
+ | |||
+ | < | ||
+ | |||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | Mar 21 04:03:57 hpcstore2 (da11: | ||
+ | |||
+ | 1) Storage > Pools. Click gear icon next to the pool and press the " | ||
+ | 2) Find da11 and press the three-dot options button next to it, then press " | ||
+ | 3) System > View Enclosure, find& | ||
+ | 4) Physically swap the drive on the rack with its replacement. | ||
+ | 5) Storage > Pool > Status page, bring up three-dot options for the removed drive, | ||
+ | 5a) Select member disk from drop down, and press " | ||
+ | 6) Wait till resilver finishes. | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Console hangs ==== | ||
+ | |||
+ | 12.7 | ||
+ | |||
+ | As for the issue of the " | ||
+ | |||
+ | service middlewared stop\\ | ||
+ | service middlewared start | ||
+ | |||
+ | " | ||
+ | |||
+ | |||
+ | |||
+ | ==== Update 12 ==== | ||
+ | |||
+ | System > Update > Select (new train 12.0-STABLE) | ||
+ | |||
+ | ** Open a console on both controllers without double ssh sessions, directly to hpcstore1/ | ||
+ | |||
+ | '' | ||
+ | '' | ||
+ | |||
+ | Then download updates on passive, check version '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | ...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% | ||
+ | |||
+ | reboot passive | ||
+ | |||
+ | from active ping passive heartbeat IP, when up | ||
+ | |||
+ | check version passive | ||
+ | |||
+ | check boot env '' | ||
+ | |||
+ | on passive '' | ||
+ | |||
+ | now force fail over via GUI (interruptive for 6o seconds) | ||
+ | |||
+ | Anthony did a reboot on active instead, watch log for personality swap | ||
+ | |||
+ | then update the new passive | ||
+ | |||
+ | '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | then check version, reboot new passive, check version, become new standby | ||
+ | |||
+ | Result: personality switch active vs standby, took 35 mins | ||
+ | |||
+ | In two months: ZFS feature updates pathch, not interruptive, | ||
+ | Upgrade done | ||
+ | --- // | ||
+ | |||
+ | Storage > Pool > " | ||
+ | |||
+ | ** 12.0-U4.1 ** | ||
+ | |||
+ | * ditto above, see major release upgrade below | ||
+ | * but old active did not come up, reset controller | ||
+ | * click on " | ||
+ | * hmm something about failed to connect failoverscratchdisk? | ||
+ | |||
+ | ** 12.0-U5.1** | ||
+ | |||
+ | * standby reboot 5 mins | ||
+ | * fail over 1 min | ||
+ | * new standby "apply pending updates" | ||
+ | * this version went fine | ||
+ | |||
+ | __Not created/ | ||
+ | While the underlying issues have been fixed, this setting continues to be disabled by default for additional performance investigation. To manually reactivate persistent L2ARC, log in to the TrueNAS Web Interface, go to System > Tunables, and add a new tunable with these values: | ||
+ | < | ||
+ | Type = sysctl | ||
+ | Variable = vfs.zfs.l2arc.rebuild_enabled | ||
+ | Value = 1 | ||
+ | </ | ||
+ | |||
+ | From support: In an HA environment, | ||
+ | |||
+ | ** 12.0-U6 ** | ||
+ | |||
+ | * same as 5.1, went fine, | ||
+ | * new standby reboot 5 mins | ||
+ | |||
+ | |||
+ | ** 12.0-U6.1 ** | ||
+ | |||
+ | * same as 6, went fine, | ||
+ | * little flakiness on failover, apply pending appeared twice | ||
+ | * let it go 10 mins, use ping hostname to test | ||
+ | * new standby reboot 5 mins | ||
+ | |||
+ | ** 12.0-U7 ** | ||
+ | |||
+ | * major OpenZFS update | ||
+ | * same as update 12.0 | ||
+ | * no problems | ||
+ | * cpu was unusually busy before upgrade | ||
+ | * terminated some rsyncs | ||
+ | |||
+ | ** 12.0-U8 ** | ||
+ | |||
+ | * 02/23/2022 | ||
+ | * no problems | ||
+ | |||
+ | ** 12.0-U8.1 ** | ||
+ | |||
+ | * 05/03/2022 | ||
+ | * failover success at 10 mins | ||
+ | * then no Pending box, just a Continue button | ||
+ | * watch console messages, at 17 mins HA enabled | ||
+ | |||
+ | ==== Update 13 ==== | ||
+ | |||
+ | System > Update > Select (new train 13.0-STABLE) | ||
+ | |||
+ | | ||
+ | |||
+ | freenas-update -v -T TrueNAS-13.0-STABLE update | ||
+ | |||
+ | …10%…20%…30%…40%…50%…60%…70%…80%…90%…100% | ||
+ | |||
+ | beadm list | ||
+ | | ||
+ | |||
+ | once both have finished, reboot passive, web gui log back in | ||
+ | |||
+ | once passive back up, reboot active | ||
+ | |||
+ | web gui log back into new active, wait for HA to be enabled | ||
+ | |||
+ | debug plus screenshots for snapshot visibility which is visible (working in 13.0-U3.1) but database setting is still invisble | ||
+ | |||
+ | took less than 35 mins | ||
+ | |||
\\ | \\ |