DokuWiki

TrueNAS/ZFS m40ha

Notes on the deployment and production changes on our 500T IXsystem m40ha storage appliance.

Fixed the date on controllers by pointing ntpd to 129.133.1.1

ES 60 middle amber light blinking which is ok, green health check on right

SAS ports A1 and B1 green led

Verifies read and write caches (but do not remeber where…)

As opposed to the X series the 1G truenas mgnt port is not used

cyclic message was duplicate VHIDs

summary

To recap we were able to verify that the issue with the x20 webUI not loading after the M40 was plugged in was due to duplicate VHIDs. Once that was changed we saw the issue resolve. We also verified that read and write drives showed as healthy as well as the pool and HA were all enabled and healthy. From there we setup our other network interface and then created our NFS share. Next we configured our NFS share to be locked down to certain hosts and Networks and then setup a replication task from your X series to your new M series for the NFS shares.

populating /zfshomes

So once replication task is done all the home dirs should show up on the M40HA

confirm /zfshome populates on target when replication finished = check
- pgrep -laf sending
confirm old snapshots get deleted on both sides = check
- yes on source, no on target; set retention policy “same as source” on destination of task
- manual cleaning on target for now (update; not needed per above) = check
confirm new snapshots get transferred to target = check
confirm new content appears in target homedirs = check
confirm deletions happen on target homedirs = check

start account creation when replication task is not running = in progress
- hpc100-hpc200 first
- then active accounts only (last year from Q data)

Upon nearing cut over day

Idle the HPC (no jobs, no users)
Disable daily snapshots
Create manual snapshot (name auto-)
Confirm replication then disable replication

Once the /zfsmount is in production

MA40HA daily snapshots
Replicate a PUSH to X20HA

replication

First-time replication tasks can take a long time to complete as the entire snapshot must be copied to the destination system. Replicated data is not visible on the receiving system until the replication task completes. Later replications only send the snapshot changes to the destination system. Interrupting a running replication requires the replication task to restart from the beginning.

The target dataset on the receiving system is automatically created in read-only mode to protect the data. To mount or browse the data on the receiving system, create a clone of the snapshot and use the clone. We set IGNORE so should be read/write on M40. Enable SSH on target

On source System > SSH Keypairs

name replication
generate key pair
save

On source System > SSH Connections > Add

name replication
host IP or FQDN of target (select Manual)
username root
discover remote ssh key

On source Tasks > Replication Tasks

name zfshomesrepel
direction PUSH
transport SSH
check enabled
ssh connection replication (from above)
on source side
- tank/zfshomes
- recursive not checked
- include dataset properties
- tank/zfsshome-0auth-%Y%m%d.%H%M-1y-6 MONTHS(S)- Enabled
- check run automatically
on target side
- ssh connection replication
- tank/zfshomes
- IGNORE
- snapshot retention None

Save

On source Replication Tasks shows enabled and PENDING

You could kick this off with Run NOW in in Edit menu of task.

Session with Marc

replication and snapshots must be disabled on target
target host, snapshots can be same as on source
but possible to set to say 2 weeks via custom
- slowly throttle this back 180 to 120 to 60 to ?
use IP rather than hostname !! in URL https://129.133.52.245

Session with Barak

postpone update till spring break
create “zfshomes-c1toc1 Key”, paste in private key
create “zfshomes-c1toc1” ssh connection (use ip above, check discover status)
switch current replication task to use new ssh connector
works! when running picks up last 8 missed snapshots

replication and NFS

When M40HA zfshomes is mounted on scratch server for testing via 10.10.0.0/16, the mount becomes unavailable when replication kicks off on X20HA with target M40HA. That's a problem. So on cut over day stop PUSH replication and Periodic Snapshots on X20HA. After all mounts have been switched to M40HA “zfshomes” configure Periodic Snapshots on M40HA. Then configure Periodic Snapshots on m40HA and PUSSh to X20HA.

[root@greentail52 ~]# df -h /mnt/m40/zfshomes
df: ‘/mnt/m40/zfshomes’: Input/output error
[root@greentail52 ~]# umount /mnt/m40/zfshomes
[root@greentail52 ~]# mount /mnt/m40/zfshomes
mount.nfs: access denied by server while mounting 10.10.102.240:/mnt/tank/zfshomes
[root@greentail52 ~]# echo restarting nfs service on m40ha
restarting nfs service on m40ha
[root@greentail52 ~]# sleep 5m
[root@greentail52 ~]# mount /mnt/m40/zfshomes
[root@greentail52 ~]# df -h /mnt/m40/zfshomes
Filesystem                        Size  Used Avail Use% Mounted on
10.10.102.240:/mnt/tank/zfshomes  435T  131T  305T  31% /mnt/m40/zfshomes
[root@greent[root@greentail52 ~]# date
Tue Oct 15 11:56:09 EDT 2024
[root@greentail52 ~]# date
Tue Oct 15 11:56:09 EDT 2024
[root@greentail52 ~]# df -h /mnt/m40/zfshomes/
Filesystem                        Size  Used Avail Use% Mounted on
10.10.102.240:/mnt/tank/zfshomes  435T  131T  305T  31% /mnt/m40/zfshomes
[root@greentail52 ~]# echo mount ok overnight, re-eanbling replication on x20
mount ok overnight, re-eanbling replication on x20
[root@greentail52 ~]# echo replication task, run now
replication task, run now
[root@greentail52 ~]# df -h /mnt/m40/zfshomes/
df: ‘/mnt/m40/zfshomes/’: Stale file handle

failover and replication

Testing Failover and assess that rerplication continues (x20ha PUSH to m40ha; make sure both controllers have the authorized_keys for hpcstore1 - add hpcstore2)

Initiated failover from m40ha controller 2, an error window message pops up
“Node can not be reached. Node CARPS states do not agree”

Yet my web browser shows hpcm40eth0c2 and a minute later hpcm40eth0c1 shows up and HA is enabled.

Replication of snapshots continues ok after failover which was the point of testing.

Initiated failover again and now back to controller 1
Controller 2 shows up a minute later (reboots)
No error window this time
Time is wrong on controller 2 …
Load IPMI go to configuration, date and time, enable NTP, refresh
- that fixes the time
- button selected back to “disabled”

Check replication again. Do this one more time before production.

12.x docs
Failover is not allowed if both TrueNAS controllers have the same CARP state. A critical Alert (page 303) is generated and the HA icon shows HA Unavailable.

CAs & Certs

Generate a CSR, insert year in Name
then…
download inCommon
- Issuing CA certificates only:
- as Root/Intermediate(s) only, PEM encoded (first one in chain section)
menu CAs > import CA
copy all info into public
menu Certs > import a cert
- signing auth, point to the CSR
- as Certificate only, PEM encoded (first one in certs format)
- copy in public
don't click http → https s0 you don't get locked out
when cert expires on you, just access https

Update 13

13.0-U6.3 11/22/2024

apply pending update
10 mins, standby on new update
initiate fail over on standby; 1 mins
look for the icon in top bar, moving back and forth
Pending update > Continue (3 mins in) finish upgrade
wait for HA to be enabled (about 10 mins in)
check versions

13.0-U6.3 11/22/2024

10 mins download and standby reboot
1 min fail over
8 1/2 min standby reboot
check HA and versions

13.0-U6.4 01/22/2025

no problems

13.0-U6.7 03/10/2025

no problems
standby reboot took a little longer, about 12 mins.

Back

DokuWiki

User Tools

Site Tools

Table of Contents