User Tools

Site Tools


cluster:194

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:194 [2021/03/09 10:12]
hmeij07 [Update 12]
cluster:194 [2024/01/03 13:50] (current)
hmeij07 [certs]
Line 151: Line 151:
  
 </code> </code>
 +
 +==== certs ====
 +
 +  * Go to System > CA and certs
 +  * Add a cert
 +  * name is "hpcstore-year-CSR", type is CSR
 +  * FQDN for hpstore.wesleyan.edu, alternate names hpcstore[1&2].wesleyan.edu
 +  * fill in just the basics, no contraints or advance settings
 +  * Add, then view CSR section, copy and provide to inCommon admin
 +
 +  * Once you get email back Available formats:
 +  * as Certificate only, PEM encoded ... download, open in notepad, copy to clipboard
 +  * Go to System > CA and certs
 +  * Select type "import a cert" 
 +  * check CSR exitsson this system
 +  * caertficate authrority (pick CSR from dropdown list)
 +  * paste in public key in certificate field
 +  * paste in private key from CSR
 +    * or Csr checkbox on system (this option)
 +  * System > General, switch certs,Save (this will restart web services)
 +  * check in new browser
  
 ==== ZFS ==== ==== ZFS ====
Line 164: Line 185:
 zfs userspace  tank/zfshomes zfs userspace  tank/zfshomes
 zfs groupspace tank/zfshomes zfs groupspace tank/zfshomes
 +
 +# uttlerly bizarre in v12 these commands change
 +
 +root@hpcstore2[~]# su - hmeij07
 +hpcstore2%
 +hpcstore2% zfs get userused@hmeij07 tank/zfshomes
 +NAME           PROPERTY          VALUE             SOURCE
 +tank/zfshomes  userused@hmeij07  718K              local
 +hpcstore2% zfs get userquota@hmeij07 tank/zfshomes
 +NAME           PROPERTY           VALUE              SOURCE
 +tank/zfshomes  userquota@hmeij07  500G               local
 +hpcstore2% zfs get userspace@hmeij07 tank/zfshomes
 +bad property list: invalid property 'userspace@hmeij07'
 +
  
 # hpc100 # hpc100
Line 324: Line 359:
  
 ==== Snapshots ==== ==== Snapshots ====
 +
 +Snapshots made easier in new releases ... traverse to the hidden directory ''/zfshomes/.zfs/snapshot'' and find the snapshot day desired. Content is read only. Once you ''cd'' into a snapshpot an autofs mount is performed.
 +
 +<code>
 +
 +129.133.52.245:/mnt/tank/zfshomes/.zfs/snapshot/auto-20221126.0200-1y  
 +251T   77T  175T  31%   /zfshomes/.zfs/snapshot/auto-20221126.0200-1y
 +
 +</code>
 +
  
   * Daily snapshots, one per day, kept for a year (for now)   * Daily snapshots, one per day, kept for a year (for now)
Line 332: Line 377:
     * check permissions on cloned volume, not windows!     * check permissions on cloned volume, not windows!
     * NOTE: once had mnt/tank/zfshomes also reset to windows, nasty, permissions denied errors     * NOTE: once had mnt/tank/zfshomes also reset to windows, nasty, permissions denied errors
-    * when cloning grant access to 192.168.0.0/16 and 10.10.0.0/16+    * when cloning grant access to <del>192.168.0.0/16 and 10.10.0.0/16</del> greentail52 129.133.52.226
     * NFS mount, read only     * NFS mount, read only
     * maproot ''root:wheel'' (also for mnt/tankl/zfshomes)     * maproot ''root:wheel'' (also for mnt/tankl/zfshomes)
-  * Clone mounted on say ''cottontail2:/mnt/clone"date"''+  * Clone mounted on say ''cottontail2:/mnt/hpc_store_snapshot"''
   * Restore actions by user   * Restore actions by user
   * Delete clone when done   * Delete clone when done
Line 388: Line 433:
   * At 100% stanby reboots, HA disables, file system ok   * At 100% stanby reboots, HA disables, file system ok
   * Check version on standby, Initiate Fail Over (interrupts file system)   * Check version on standby, Initiate Fail Over (interrupts file system)
-  * Login, hyou end up on updated, now active server +  * Login, you end up on new active server, then 
-  * Logout/loginPending Update, Continue +  * Logout/login to check left-top dashboard server  
-  * Wait for HA to be enabled, check version on new standby+  * when old active server boots "Pending Update" confirm to complete 
 +  * check version on new standby 
 + 
 + 
 ==== HDD ==== ==== HDD ====
  
Line 448: Line 497:
 /dev/da10 HGST:7200:HUS728T8TAL4201:VAKM187L C:30 dR:2 dW:2503 dL:55 uR:0 uW:0 SMART Status:OK **!!!** /dev/da10 HGST:7200:HUS728T8TAL4201:VAKM187L C:30 dR:2 dW:2503 dL:55 uR:0 uW:0 SMART Status:OK **!!!**
 /dev/da9 HGST:7200:HUS728T8TAL4201:VAKL26ML C:30 dR:3 dW:0 dL:0 uR:0 uW:39 SMART Status:OK **!!!** /dev/da9 HGST:7200:HUS728T8TAL4201:VAKL26ML C:30 dR:3 dW:0 dL:0 uR:0 uW:39 SMART Status:OK **!!!**
-# these drives have not failed yet ut have write errors, offline/replace, see below+# these drives have not failed yet but have write errors, offline/replace, see below
  
 # next look at output of zpool status -x in fndebug/ZFS/dump.txt # next look at output of zpool status -x in fndebug/ZFS/dump.txt
Line 494: Line 543:
  
 </code> </code>
 +
 +** Pool Unhealthy but not Degraded status**
 +
 +No failed disks, no deploy of spare, but pool unhealthy.  The ''dump.txt'' files for SMART and ZFS show nothing remarkable. But in the console log we observe that disk //da11// has problems. RMA issued. 3rd replacement disk in a year.
 +
 +<code>
 +
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): READ(10). CDB: 28 00 1b b0 80 13 00 00 02 00 
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): CAM status: SCSI Status Error
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): SCSI status: Check Condition
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): Descriptor 0x80: f7 72
 +Mar 21 04:03:57 hpcstore2 (da11:mpr0:0:21:0): Error 5, Unretryable error
 +
 +1) Storage > Pools. Click gear icon next to the pool and press the "Status" option.
 +2) Find da11 and press the three-dot options button next to it, then press "Offline".
 +3) System > View Enclosure, find&select da11, press "Identify".
 +4) Physically swap the drive on the rack with its replacement.
 +5) Storage > Pool > Status page, bring up three-dot options for the removed drive, 
 +5a) Select member disk from drop down, and press "Replace". Success popup, click Close.
 +6) Wait till resilver finishes.
 +
 +</code>
 +
 +
 +==== Console hangs ====
 +
 +12.7
 +
 +As for the issue of the "Please Wait" box spins forever (when creating a new user via GUI) I would try refreshing the WebUI service and seeing if that fixes the issue. You can do this by running the following commands via a SSH session to the VIP.
 +
 +service middlewared stop\\
 +service middlewared start
 +
 +"try" not very convincing ... I open another tab, check user, close previous tab ... also the .[a-z]* csh hidden files are not created - no bother, we use ''bash''
 +
 +
  
 ==== Update 12 ==== ==== Update 12 ====
Line 536: Line 622:
 Result: personality switch active vs standby, took 35 mins Result: personality switch active vs standby, took 35 mins
  
-In two months: ZFS feature updates pathch, not interruptive, do around 04/09/2021+In two months: ZFS feature updates pathch, not interruptive, <del>do around 04/09/2021</del>\\ 
 +Upgrade done 
 + --- //[[hmeij@wesleyan.edu|Henk]] 2021/06/07 07:40//
  
 Storage > Pool > "wheel" > Upgrade Pool Storage > Pool > "wheel" > Upgrade Pool
  
 +** 12.0-U4.1 **
  
 +  * ditto above, see major release upgrade below
 +  * but old active did not come up, reset controller
 +  * click on "Pending Update", then it came up, an HA enabled
 +  * hmm something about failed to connect failoverscratchdisk?
 +
 +** 12.0-U5.1**
 +
 +  * standby reboot 5 mins
 +  * fail over 1 min
 +  * new standby "apply pending updates" 10 mins
 +  * this version went fine
 +
 +__Not created/set, see below__ ...\\
 +While the underlying issues have been fixed, this setting continues to be disabled by default for additional performance investigation. To manually reactivate persistent L2ARC, log in to the TrueNAS Web Interface, go to System > Tunables, and add a new tunable with these values:
 +<code>
 +Type = sysctl
 +Variable = vfs.zfs.l2arc.rebuild_enabled
 +Value = 1
 +</code>
 +
 +From support: In an HA environment, this tunable actually delays failover while it ensures L2ARC. The tunable for the "persistent L2ARC" Preloads your ARC with what you had before, but slows down imports and failovers. Not super useful if you don't reboot or failover often.
 +
 +** 12.0-U6 **
 +
 +  * same as 5.1, went fine, 
 +  * new standby reboot 5 mins
 +
 +
 +** 12.0-U6.1 **
 +
 +  * same as 6, went fine, 
 +  * little flakiness on failover, apply pending appeared twice
 +  * let it go 10 mins, use ping hostname to test
 +  * new standby reboot 5 mins
 +
 +** 12.0-U7 **
 +
 +  * major OpenZFS update
 +  * same as update 12.0
 +  * no problems
 +  * cpu was unusually busy before upgrade
 +    * terminated some rsyncs
 +
 +** 12.0-U8 **
 +
 +  * 02/23/2022
 +  * no problems
 +
 +** 12.0-U8.1 **
 +
 +  * 05/03/2022
 +  * failover success at 10 mins
 +  * then no Pending box, just a Continue button
 +  * watch console messages, at 17 mins HA enabled
 +
 +==== Update 13 ====
 +
 +System > Update > Select (new train 13.0-STABLE)
 +
 +<code>
 +
 +# in shell
 +
 + freenas-update -v -T TrueNAS-13.0-STABLE check
 +
 + freenas-update -v -T TrueNAS-13.0-STABLE update
 +
 +…10%…20%…30%…40%…50%…60%…70%…80%…90%…100% 
 +
 +beadm list
 +# (Active N = 12.0-U8.1 and R = 13.0-U3.1)
 +
 +</code>
 +
 +once both have finished, reboot passive, web gui log back in
 +
 +once passive back up, reboot active
 +
 +web gui log back into new active, wait for HA to be enabled
 +
 +debug plus screenshots for snapshot visibility which is visible (working in 13.0-U3.1) but database setting is still invisble
 +
 +took less than 35 mins
 +
 +<code>
 +
 +bstop 0
 +bresume 0
 +# manual, one at a time
 +scontrol hold joblist 
 +# one at a time
 +# for i in `squeue | grep '     ' | awk '{print $1}'`; do echo $i; done
 +# then grep '     '
 +scontrol suspend joblist 
 +scontrol resume  joblist 
 +scontrol release joblist
 +
 +</code>
 +
 +
 +** 13.0-U4  ** 04/03/2023
 +
 +  * apply pending update
 +  * 10 mins, standby on new update
 +  * initiate fail over 1 mins
 +  * look for the icon in top bar, moving back and forth
 +  * finish upgrade
 +  * wait for HA to be enabled
 +  * check versions
 +
 +
 +** 13.0-U5.1  ** 07/07/2023
 +
 +  * apply pending update
 +  * 10 mins, standby on new update
 +  * initiate fail over 1 mins
 +  * look for the icon in top bar, moving back and forth
 +  * finish upgrade
 +  * wait for HA to be enabled
 +  * check versions
 +
 +
 +** 13.0-U5.3 ** 08/25/2023
 +
 +  * no problems
 +
 +**Next support ticket: Ask if you ever need to reboot the disk shelves?** Full power off?
 +
 +Hi, I'm archiving content from my TrueNAS appliance to another platform, then deleting the files migrated. I'm observing directories like this; 7.5 million files in 990 GB or 15 million files in 7 TB. Should I be concerned that the disk shelves have never be cold rebooted? Like XFS replaying the log journal for a clean mount? My HA nodes reboot on upgrade but I realize the disk arrays keep running, always.  Please advise, thanks.
 +
 +Tier 1 Support: The 2 ES24 shelves do not require to be rebooted as they just house the drives themselves and provide power to them. There shouldn't be any concern with these being on at all time.
 +
 +Rsync stats (after decompressing)
 +<code>
 +sod1/
 +Number of files: 18,691,764
 +Total transferred file size: 13072322138140 bytes
 +arnt_rosetta/
 +Number of files: 8,825,674 
 +Total transferred file size: 1,798,675,349,128 bytes
 +</code>
  
  
  
 +** 13.0-U6.1 ** 12/12/2023
  
 +  * no problems
  
 \\ \\
cluster/194.1615302741.txt.gz · Last modified: 2021/03/09 10:12 by hmeij07