User Tools

Site Tools


cluster:226


Back

TrueNAS/ZFS m40ha

Notes on the deployment and production changes on our 500T IXsystem m40ha storage appliance.

Fixed the date on controllers by pointing ntpd to 129.133.1.1

ES 60 middle amber light blinking which is ok, green health check on right

  • SAS ports A1 and B1 green led

Verifies read and write caches (but do not remeber where…)

As opposed to the X series the 1G truenas mgnt port is not used

cyclic message was duplicate VHIDs

summary

To recap we were able to verify that the issue with the x20 webUI not loading after the M40 was plugged in was due to duplicate VHIDs. Once that was changed we saw the issue resolve. We also verified that read and write drives showed as healthy as well as the pool and HA were all enabled and healthy. From there we setup our other network interface and then created our NFS share. Next we configured our NFS share to be locked down to certain hosts and Networks and then setup a replication task from your X series to your new M series for the NFS shares.

populating /zfshomes

So once replication task is done all the home dirs should show up on the M40HA

  • confirm /zfshome populates on target when replication finished = check
    • pgrep -laf sending
  • confirm old snapshots get deleted on both sides = check
    • yes on source, no on target; set retention policy “same as source” on destination of task
    • manual cleaning on target for now (update; not needed per above) = check
  • confirm new snapshots get transferred to target = check
  • confirm new content appears in target homedirs = check
  • confirm deletions happen on target homedirs = check
  • start account creation when replication task is not running = in progress
    • hpc100-hpc200 first
    • then active accounts only (last year from Q data)

Upon nearing cut over day

  • Idle the HPC (no jobs, no users)
  • Disable daily snapshots
  • Create manual snapshot (name auto-)
  • Confirm replication then disable replication

Once the /zfsmount is in production

  • MA40HA daily snapshots
  • Replicate a PUSH to X20HA

replication

First-time replication tasks can take a long time to complete as the entire snapshot must be copied to the destination system. Replicated data is not visible on the receiving system until the replication task completes. Later replications only send the snapshot changes to the destination system. Interrupting a running replication requires the replication task to restart from the beginning.

The target dataset on the receiving system is automatically created in read-only mode to protect the data. To mount or browse the data on the receiving system, create a clone of the snapshot and use the clone. We set IGNORE so should be read/write on M40. Enable SSH on target

On source System > SSH Keypairs

  • name replication
  • generate key pair
  • save

On source System > SSH Connections > Add

  • name replication
  • host IP or FQDN of target (select Manual)
  • username root
  • discover remote ssh key

On source Tasks > Replication Tasks

  • name zfshomesrepel
  • direction PUSH
  • transport SSH
  • check enabled
  • ssh connection replication (from above)
  • on source side
    • tank/zfshomes
    • recursive not checked
    • include dataset properties
    • tank/zfsshome-0auth-%Y%m%d.%H%M-1y-6 MONTHS(S)- Enabled
    • check run automatically
  • on target side
    • ssh connection replication
    • tank/zfshomes
    • IGNORE
    • snapshot retention None

Save

On source Replication Tasks shows enabled and PENDING

You could kick this off with Run NOW in in Edit menu of task.

replication and NFS

When M40HA zfshomes is mounted on scratch server for testing via 10.10.0.0/16, the mount becomes unavailable when replication kicks off on X20HA with target M40HA. That's a problem. So on cut over day stop PUSH replication and Periodic Snapshots on X20HA. After all mounts have been switched to M40HA “zfshomes” configure Periodic Snapshots on M40HA. Then configure Periodic Snapshots on m40HA and PUSSh to X20HA.

[root@greentail52 ~]# df -h /mnt/m40/zfshomes
df: ‘/mnt/m40/zfshomes’: Input/output error
[root@greentail52 ~]# umount /mnt/m40/zfshomes
[root@greentail52 ~]# mount /mnt/m40/zfshomes
mount.nfs: access denied by server while mounting 10.10.102.240:/mnt/tank/zfshomes
[root@greentail52 ~]# echo restarting nfs service on m40ha
restarting nfs service on m40ha
[root@greentail52 ~]# sleep 5m
[root@greentail52 ~]# mount /mnt/m40/zfshomes
[root@greentail52 ~]# df -h /mnt/m40/zfshomes
Filesystem                        Size  Used Avail Use% Mounted on
10.10.102.240:/mnt/tank/zfshomes  435T  131T  305T  31% /mnt/m40/zfshomes
[root@greent[root@greentail52 ~]# date
Tue Oct 15 11:56:09 EDT 2024
[root@greentail52 ~]# date
Tue Oct 15 11:56:09 EDT 2024
[root@greentail52 ~]# df -h /mnt/m40/zfshomes/
Filesystem                        Size  Used Avail Use% Mounted on
10.10.102.240:/mnt/tank/zfshomes  435T  131T  305T  31% /mnt/m40/zfshomes
[root@greentail52 ~]# echo mount ok overnight, re-eanbling replication on x20
mount ok overnight, re-eanbling replication on x20
[root@greentail52 ~]# echo replication task, run now
replication task, run now
[root@greentail52 ~]# df -h /mnt/m40/zfshomes/
df: ‘/mnt/m40/zfshomes/’: Stale file handle

failover and replication

Testing Failover and assess that rerplication continues (x20ha PUSH to m40ha; make sure both controllers have the authorized_keys for hpcstore1 - add hpcstore2)

  • Initiated failover from m40ha controller 2, an error window message pops up
  • “Node can not be reached. Node CARPS states do not agree”

Yet my web browser shows hpcm40eth0c2 and a minute later hpcm40eth0c1 shows up and HA is enabled.

Replication of snapshots continues ok after failover which was the point of testing.

  • Initiated failover again and now back to controller 1
  • Controller 2 shows up a minute later (reboots)
  • No error window this time
  • Time is wrong on controller 2 …
  • Load IPMI go to configuration, date and time, enable NTP, refresh
    • that fixes the time
    • button selected back to “disabled”

Check replication again. Do this one more time before production.

12.x docs
Failover is not allowed if both TrueNAS controllers have the same CARP state. A critical Alert (page 303) is generated and the HA icon shows HA Unavailable.

CAs & Certs

  • Generate a CSR, insert year in Name
  • then…
  • download inCommon
    • Issuing CA certificates only:
    • as Root/Intermediate(s) only, PEM encoded (first one in chain section)
  • menu CAs > import CA
  • copy all info into public
  • menu Certs > import a cert
    • signing auth, point to the CSR
    • as Certificate only, PEM encoded (first one in certs format)
    • copy in public
  • don't click http → https s0 you don't get locked out
  • when cert expires on you, just access https:
    Back
cluster/226.txt · Last modified: 2024/10/31 18:18 by hmeij07