DokuWiki

This is an old revision of the document!

TrueNAS/ZFS

Notes. Mainly for me but might be useful/of interest to users.

Message:

Our current file server is sharptail.wesleyan.edu which serves out home directories (/home, 10T). A new file server hpcstore.wesleyan.edu will be deployed taking over this function (/zfshomes, 190T). This notice is to inform you your home directory has been cut over.

There are no changes for you. When you log into cottontail or cottontail2 you end up in your new home directory. $HOME and ~username work as usual. The only difference is that your old home was at /home/username and now it is at /zfshomes/username.

If you wish to load/unload large content from your new home directory please log into hpcstore.wesleyan.edu directly (via ssh/sftp) or preferably use rsync with a bandwidth throttle no larger than “–bwlimit=5000”.

Details at
https://dokuwiki.wesleyan.edu/doku.php?id=cluster:194

Summary

SSH (sftp/scp)

# from outside via VPN
$ ssh hpc21@hpcstore.wesleyan.edu

hpc21@hpcstore.wesleyan.edu's password:
FreeBSD 11.2-STABLE (TrueNAS.amd64) 
(banner snip ...)
Welcome to TrueNAS

# note we ended up on node "B"
[hpc21@hpcstore2 ~]$ pwd
/mnt/tank/zfshomes/hpc21
[hpc21@hpcstore2 ~]$ echo $HOME
/mnt/tank/zfshomes/hpc21

# quota check
[hpc21@hpcstore2 ~]$  zfs userspace tank/zfshomes | egrep -i "quota|$USER"
TYPE        NAME      USED  QUOTA
POSIX User  hpc21     282K   500G


# from inside HPCC with ssh keys properly set up
[hpc21@cottontail ~]$ ssh hpcstore
Last login: Mon Mar 23 10:58:27 2020 from 129.133.52.222

[hpc21@cottontail ~]$ echo $HOME
/zfshomes/hpc21

[hpc21@hpcstore2 ~]$ df -h .
Filesystem       Size    Used   Avail Capacity  Mounted on
tank/zfshomes    177T    414G    177T     0%    /mnt/tank/zfshomes

RSYNC

[hmeij@ThisPC]$ rsync -vac --dry-run --whole-file --bwlimit=4096  \
c:\Users\hmeij\ hpcstore:/mnt/tank/zfshomes/hmeij/ 
sending incremental file list
...

SMB/CIFS
- all users have shares but not class accounts (hpc101-hpc200)

Not any more. Serious conflict between NFS and SMB ACLs if both protocols enabled on same dataset. So nobody has a samba share. If you want to drop&drag you need to use something like CyberDuck and make an sftp connection.

— Henk 2020/05/28 11:10

# windows command line
C:\Users\hmeij07>net use W: \\hpcstore.wesleyan.edu\hmeij /user:hmeij
Enter the password for 'hmeij' to connect to 'hpcstore.wesleyan.edu':
The command completed successfully.

# or ThisPC > Map Network Drive
\\hpcstore.wesleyan.edu\username
# user is hpcc username, password is hpcc password

Consoles

port 5
- set up mac
- plug in pin2usb cable (look for device /dev/cu.usbserial-*)
- launch terminal, invoke screen
  - screen /dev/cu.usbserial-* baudrate -L 38400
  - sysadmin/superuser, now you can set basic stuff via ipmi
  - ifconfig ehto | grep 'inet addr' or
  - ipmitool -H 127.0.0.1 -U admin -P admin lan print
  - ipmitool -H 127.0.0.1 -U admin -P admin lan set 1 ipaddr … (etc + netmask + defgw)
- to set initial ips/netmasks
port 10 (if 12 option netcli boot menu does not show)
- unplug console cable, plug in pin2serial cable
- set up windows laptop, launch hyperterminal, baud rate 115000
- 12 menu netcli, change/reset root passsword here
port 80→443, web site (with shell of netcli)
- gui
- shell
- all non-zfs commands are persistent across boots
- except ssh keys and directory permissions

HA

High Availability. Two controllers hpcstore1 (also known as A) and hpcstore2 (also known as B).

Virtual IP hpcstore.wesleyan.edu floats back and forth seamlessly (tested, some protocols will loose connectivity). In a split brain situation (no response, both controllers think they are it), disconnect one controller from power then reboot. Then reconnect and wait a few minutes for HA icon to turn green when controller comes online.

An update goes like this and is not an interruption. Check for and apply updates. They are applied to partner and partner is rebooted. When partner comes back up it becomes the primary. Now you need apply updates to other partner. When it comes back up it remains secondary node. Check that nodes run the same version.

SSH

Allowed for large content transfers using scp or sftp or just checking things out.
TODO: rsync?

Home directories are located in /mnt/tank/zfshomes. When users get cut over their location will be updated in the /etc/passwd file and $HOME becomes /zfshomes/username. So we can keep track of that. Followed by an rsync process that will from TrueNAS/ZFS appliance rsync to sharptail:/home.
TODO: write script. \
TODO: add disksold sharptail:/home, enlarge and merge LVMs.
TODO: backup target

# create user, no new but set primary + auxillary groups, full-name
# set shell, set permissions, some random passwd date +%N with symbols
# then move all dot files into ~/._nas scp ~/.ssh over
# copy content over from sharptail, @hpcstore...
rsync -ac --bwlimit=4096 --whole-file --stats sharptail:/home/hmeij/  /mnt/tank/zfshomes/hmeij/ 
# SSH keys in place so should be passwordless, test
ssh username@hpcstore.wesleyan.edu

# go to $HOME
cd /mnt/tank/zfshomes/username

# this will be mounted HPC wide at
/zfshomes/username

ZFS

https://docs.oracle.com/cd/E19253-01/819-5461/index.html

# for users
zfs allow everyone userquota,userused tank/zfshomes

# as user
zfs userspace  tank/zfshomes
zfs groupspace tank/zfshomes

# hpc100
TYPE        NAME     USED  QUOTA
POSIX User  hpc100  14.9G   100G
POSIX User  root       1K   none

# set quota
zfs set userquota@hpc100=100g  tank/zfshomes
zfs set groupquota@hpc100=100g tank/zfshomes

# get userused
zfs get userused@hpc100 tank/zfshomes

# list snapshots
zfs list -t snapshot

# output
NAME                                            USED  AVAIL  REFER  MOUNTPOINT
freenas-boot/ROOT/default@2019-12-17-22:04:34  2.10M      -   827M  -
tank/zfshomes@auto-20200309.1348-1y             210K      -   558K  -
tank/zfshomes@auto-20200310.1348-1y             219K      -  14.8G  -
tank/zfshomes@auto-20200311.1348-1y             165K      -  14.9G  -

# health
zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:02 with 0 errors on Sun Feb  2 03:00:04 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/104a748f-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/10d0c16e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/115414b8-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/11dd105d-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/12636cff-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/12e6d913-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/13676269-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/13ee7fb2-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/14706a76-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1504c334-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1592a623-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/16250571-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/16b4a392-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/173e4974-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/17cb4efb-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1861c750-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/18ef1edd-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/197d9fc9-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1a09eebb-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1a99e25d-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1b2dd0b5-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1bbaa252-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/1c60422c-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1cedf16e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1d807f27-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1e0d0a20-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1e9dec87-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1f603e96-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/1ff8b82e-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/2087c210-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/21128be3-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/21ab0c6c-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
            gptid/2241e3e2-211a-11ea-bbd5-b496915e40c8  ONLINE       0     0     0
        logs
          gptid/238a4161-211a-11ea-bbd5-b496915e40c8    ONLINE       0     0     0
        cache
          gptid/23426c62-211a-11ea-bbd5-b496915e40c8    ONLINE       0     0     0
        spares
          gptid/22f36c47-211a-11ea-bbd5-b496915e40c8    AVAIL

errors: No known data errors

SMB

SMB/CIFS (Samba) shares are also created once the homedir is up. NOT!

do not mix SMB and NFS on same dataset, not supported
problems 'permission denied' in NFS on top level of share (deeper is ok)
windows ACLs on top of unix file system = bad

#         v that plus is the problem
drwxr-xr-x+ 147 root  wheel  147 Apr 27 08:17 /mnt/tank/zfshomes

# either use ACL editor to strip off in v13.1-U2 or

setfacl -bn /mnt/tank/zfshomes/

followed by for example

find /mnt/tank/zfshomes/hmeij/ -type d -exec setfacl -bn {} \;

# also unsupported via shell

For each user
- mnt/tanks/zfshomes/username
- uncheck default permissions
- valid users = usname, @ugroup(s)
- /usr/local/etc/smb4.conf

Note At user creation a random password is set. Please ask to have it reset to access SMB shares. (there should be some self-serve password reset functionality with email confirmation but I cannot find it for now. Any passwords changed outside of database will not be persistent across boots.

# windows, map network drive
\\hpcstore.wesleyan.edu\username

# credentials, one or all of these may work
WORKGROUP\username
localhost\username
username

Change $HOME location in /etc/passwd and propagate.
Note remove access to old $HOME … chown root:root + chmod o-rwx
END OF USER ACCOUNT SETUP

NFS

maproot is needed
export to both private networks

root@hpcstore1[~]# cat /etc/exports

/mnt/tank/zfshomes -maproot="root":"wheel" -network 192.168.0.0/16
/mnt/tank/zfshomes -maproot="root":"wheel" -network 10.10.0.0/16

/mnt/tank/zfshomes-auto-20200310.1348-1y-clone -ro \
  -maproot="root":"wheel" -network 192.168.0.0/16
/mnt/tank/zfshomes-auto-20200310.1348-1y-clone -ro \  
  -maproot="root":"wheel" -network 10.10.0.0/16

Rollback

Rollback is a potentially dangerous operation

Instead restore via snapshots. See Guide.

Snapshots

Daily snapshots, one per day, kept for a year (for now)
- Recursive
- Read only
- snapshot period 1:00-23:00
Request a clone to be made from a snapshot (specify date)
- check permissions on cloned volume, not windows!
- NOTE: once had mnt/tank/zfshomes also reset to windows, nasty, permissions denied errors
- when cloning grant access to 192.168.0.0/16 and 10.10.0.0/16
- NFS mount, read only
- maproot root:wheel (also for mnt/tankl/zfshomes)
Clone mounted on say cottontail2:/mnt/clone“date”
Restore actions by user
Delete clone when done

# mountpoints (maproot=root:wheel)
drwxr-xr-x 2 root root 4096 Mar 10 14:08 /mnt/clone0310
drwxr-xr-x 2 root root 4096 Mar  6 14:01 /zfshomes

# /etc/fstab examples (either private network)
#192.168.102.245:/mnt/tank/zfshomes    \
/zfshomes      nfs rw,tcp,soft,intr,bg,vers=3
#10.10.102.245:/mnt/tank/zfshomes      \
/zfshomes      nfs rw,tcp,soft,intr,bg,vers=3

#192.168.102.245:/mnt/tank/zfshomes-auto-20200310.1348-1y-clone  \
/mnt/clone0310      nfs ro,tcp,soft,intr,bg,vers=3
 10.10.102.245:/mnt/tank/zfshomes-auto-20200310.1348-1y-clone    \
/mnt/clone0310      nfs ro,tcp,soft,intr,bg,vers=3

Update

Change the Train to 11.3, then you will apply the update first in the WebUI to the passive controller.

After its reboots, you will failover to it by rebooting the Active controller (the current WebUI).

This will failover to the updated 11.3-U2.1 controller (brief interruption).

From there, you would go to System > Update and do the same for the NEW passive controller.

After that initiate failover back to primary via dashboard (brief interruption).

Enable HA, click icon

Apply Pending Updates Upgrades both controllers. Files are downloaded to the Active Controller and then transferred to the Standby Controller. The upgrade process starts concurrently on both TrueNAS Controllers.

Server responds while HA disabled. You are instructed to Initiate Fail Over, do so, just take 5 seconds. The Continue with pending upgrade … wait 5 mins or so, watch console activity. THEN Log out and log back in once the passive standby is on new update.

Update takes 15 mins in total.

HDD

Two types, hard to find in stock.



8T SAS
da0: <HGST HUS728T8TAL4201 B460> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number VAKM5GTL
da0: 1200.000MB/s transfers
da0: Command Queueing enabled
da0: 7630885MB (1953506646 4096 byte sectors)
exxactcorp
https://www.exxactcorp.com/search?q=HUS728T8TAL4201

800G SSD
da2: <WDC WUSTR6480ASS201 B925> Fixed Direct Access SPC-5 SCSI device
da2: Serial Number V6V1XGDA
da2: Command Queueing enabled
da2: 763097MB (1562824368 512 byte sectors)
exxactcorp
https://www.exxactcorp.com/search?q=WUSTR6480ASS201

Logs

From support:

That information is logged via syslog for the opposite controller. For example, to find the information I did here, I looked in the syslog output on the controller that was passive at the time these alerts occurred.

You can look that information up yourself by opening an SSH session to the passive controller, navigating to the /root/syslog directory and examining the files. The “controller_{a,b}” file shows the output for today. Extract the controller_a.0.bz2 file and read the output of the resulting controller_a.0 file to see the output for yesterday. controller_a.1 would contain the output for the day before yesterday, and so on.

Back

Table of Contents