\\ **[[cluster:0|Back]]** This is for experimental purposes only. \\ Proof of concept type of a thing. \\ --- //[[hmeij@wesleyan.edu|Henk Meij]] 2007/09/28 11:38// ====== The Story Of NAT ====== The cluster is served file systems from our **[[cluster:34#netapp_fas_3050c|NetApp Fabric Attached Storage Device]]**. These file systems are NFS mounted on each compute node via the IO node. The NFS traffic is isolated to one of our private networks on the cluster, the 10.3.1.xxx subnet, running across a Cisco 7000 gigabit ethernet switch. So what happens when you have another file system that you would like to make available on the back end compute nodes? From another cluster for example. One approach is to rely on network address translation (NAT). That approach is described here so i don't forget what we did. Note that: * i'm not endorsing this approach at the current time until we test it further * any "opening up" of the private environment of the cluster introduces security risks * any "non-cluster" activities the compute nodes are involved in, potentially compromises their performance * i had no idea how this worked until Scott Knauert put it together We start by grabbing a surplus computer and install linux on it\\ We add two NIC cards (in our case capable of 100e not gigE)\\ We run a CAT6 cable from a router port to the cluster (this is gigE)\\ And we named this new host **''NAT''**. [root@NAT:~]uname -a Linux NAT 2.6.18-5-686 #1 SMP Fri Jun 1 00:47:00 UTC 2007 i686 GNU/Linux ===== Interfaces ===== The NAT box will have two interface. One on our internal VLAN 1, just like our head node ''swallowtail.wesleyan.edu''. The other one will be on the NFS private network of the cluster. So basically we have: * eth1: 129.133.1.225 * eth2: 10.3.1.10 This is defined in (Debian) ''/etc/network/interfaces'' (below). We choose VLAN 1 since we need to reach a file system hosted by ''vishnu.phy.wesleyan.edu'' in VLAN 90. # This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). # The loopback network interface auto lo iface lo inet loopback # Wesleyan auto eth1 iface eth1 inet static address 129.133.1.225 netmask 255.255.255.0 gateway 129.133.1.1 # Cluster auto eth2 iface eth2 inet static address 10.3.1.10 netmask 255.255.255.0 ===== ipTables ===== Since we are opening up the backend of the cluster's private network, we need to clamp down on the access as much as possible on the NAT box. IP table chains limit the traffic from and to the private NFS network 10.3.1.xxx and the target host 129.133.90.207. But the whole intent of the NAT host is to provide a bridge between separate networks. So any packets that need to traverse this bridge are postrouted or forwarded across. * file ''/etc/init.d/nat'' #!/bin/bash #EXTERNAL is the interface to the outside network. EXTERNAL="eth1" #INTERNAL is the interface to the local network. INTERNAL="eth2" /sbin/depmod -a /sbin/modprobe ip_tables /sbin/modprobe iptable_nat iptables --flush iptables --table nat --flush iptables --delete-chain iptables --table nat --delete-chain # added source and destination -hmeij iptables --table nat --source 10.3.1.0/24 --destination 129.133.90.207 \ --append POSTROUTING --out-interface $EXTERNAL -j MASQUERADE iptables --source 129.133.90.207 --destination 10.3.1.0/24 \ --append FORWARD --in-interface $INTERNAL -j ACCEPT echo "1" > /proc/sys/net/ipv4/ip_forward We can now test the setup by contacting the remote host and attempt to mount the remote file system: [root@NAT:~]# ping vishnu.phys.wesleyan.edu PING vishnu.phys.wesleyan.edu (129.133.90.207) 56(84) bytes of data. 64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=0 ttl=63 time=0.235 ms 64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=1 ttl=63 time=0.193 ms 64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=2 ttl=63 time=0.115 ms --- vishnu.phys.wesleyan.edu ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 0.115/0.181/0.235/0.049 ms, pipe 2 [root@NAT:~]# mount vishnu.phys.wesleyan.edu:/raid/home /mnt [root@NAT:~]# df -h /mnt Filesystem Size Used Avail Use% Mounted on vishnu.phys.wesleyan.edu:/raid/home 4.6T 1.5T 3.2T 31% /mnt [root@NAT:~]# umount /mnt ===== Routes ===== On the compute nodes we now need to change the routing of the packets. Platform/OCS had already defined a default gateway that pointed back to swallowtail_nfs (10.3.1.254). We now subsitute the NAT box private IP (10.3.1.10) for the default gateway. In addition, Platform Support wants to make sure a gateway is defined for the other private network (192.168) so that any ssh callbacks can get resolved. The commands are (added to ''/etc/rc.local''): # add for nat box on administrative network route add -net 192.168.1.0 netmask 255.255.255.0 gw 192.168.1.254 dev eth0 # change default route set by platform/ocs route add -net default netmask 0.0.0.0 gw 10.3.1.10 dev eth1 route del -net default netmask 0.0.0.0 gw 10.3.1.254 dev eth1 and now our routing tables on the compute node looks like this: [root@compute-1-1 ~]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0 192.168.1.0 swallowtail.loc 255.255.255.0 UG 0 0 0 eth0 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 10.3.1.0 * 255.255.255.0 U 0 0 0 eth1 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 224.0.0.0 * 240.0.0.0 U 0 0 0 eth0 default 10.3.1.10 0.0.0.0 UG 0 0 0 eth1 We should now be able to ''ping'' and ''mount'' the remote host and file system like we did on the NAT box. ONce we have established connectivity we can redefine the home directory for certain users. ===== AutoFS ===== The whole point of the NAT box is to make the remote home directories available to certain users. For this to work, the remote host must use the same UID/GID settings as the cluster does. The cluster uses the UID/GID settings from Active Directory (AD). Once the UID/GID settigns are synced, we change the home directory location for a user in question. After those changes, that file needs to be pushed out to the compute nodes and autofs restarted. [root@swallowtail ~]# egrep 'hmeij|sknauert' /etc/auto.home hmeij localhost:/export/home/users/hmeij sknauert vishnu.phys.wesleyan.edu:/raid/home/templarapheonix [root@swallowtail ~]# make -C /var/411 [root@swallowtail ~]# /etc/init.d/autofs restart Once autofs is reatrted on both the head node and the compute node compute-1-1, we can force the automount to mount the remote home directory [root@compute-1-1 ~]# cd ~sknauert [root@compute-1-1 sknauert]# df -h . Filesystem Size Used Avail Use% Mounted on vishnu.phys.wesleyan.edu:/raid/home/templarapheonix 4.6T 1.5T 3.2T 31% /home/sknauert ===== Tests ===== So lets write some files on compute node compute-1-1 in the remotely mounted home directory. Meaning, the packets have to flow over NFS to the NAT box which forwads the packets to the remote host. For comparison, lets do the same by writing to some other file systems. #for i in 1024 10240 102400 1024000; do echo $i; time dd if=/dev/zero of=./out.$i bs=1k count=$i; done # ls -lh -rw-r--r-- 1 sknauert s07 1M Sep 28 10:28 out.1024 -rw-r--r-- 1 sknauert s07 10M Sep 28 10:28 out.10240 -rw-r--r-- 1 sknauert s07 100M Sep 28 10:29 out.102400 -rw-r--r-- 1 sknauert s07 1000M Sep 28 10:41 out.1024000 ^ Where ^ 1024 ^ 10240 ^ 102400 ^ 1024000 ^ |vishnu.phys:/home/sknauert/TEMP|0m0.868s|0m6.404s|1m12.993s|11m59.465s| |/export/home/rusers/sknauert/TEMP|0m0.027s|0m0.160s|0m01.815s|00m20.484s| |/sanscratch/TEMP|0m0.108s|0m0.452s|0m01.626s|00m27.664s| |/localscratch/TEMP|0m0.005s|0m0.038s|0m00.370s|00m07.687s| These time recordings will wildly vary depending on competing resources ofcourse. The bottom three file systems are NFS mounted on compute node ''compute-1-1'' from the ''io-node'' over gigabit ethernet. From the ''io-node'' the connection is 4 gigabits/sec fiber channel to the NetApp filer. The connection in our test setup is limited by the 100e NIC cards in the NAT box. Also the remote host has a 100e link to VLAN 90. We should move these to gigE. ===== Errors ===== ''dmesg'' shows ... which puzzle me ... eth1: Transmit error, Tx status register 82. Probably a duplex mismatch. See Documentation/networking/vortex.txt Flags; bus-master1, dirty 887391(15) current 887391(15) Transmit list 00000000 vs. c7c30b60. 0: @c7c30200 length 800005ea status 000105ea ... \\ **[[cluster:0|Back]]**