\\
**[[cluster:0|Back]]**

This is for experimental purposes only. \\
Proof of concept type of a thing. \\
 --- //[[hmeij@wesleyan.edu|Henk Meij]] 2007/09/28 11:38//


====== The Story Of NAT ======

The cluster is served file systems from our **[[cluster:34#netapp_fas_3050c|NetApp Fabric Attached Storage Device]]**.  These file systems are NFS mounted on each compute node via the IO node.  The NFS traffic is isolated to one of our private networks on the cluster, the 10.3.1.xxx subnet, running across a Cisco 7000 gigabit ethernet switch.

So what happens when you have another file system that you would like to make available on the back end compute nodes?  From another cluster for example.  One approach is to rely on network address translation (NAT).  That approach is described here so i don't forget what we did.

Note that:

  * i'm not endorsing this approach at the current time until we test it further
  * any "opening up" of the private environment of the cluster introduces security risks
  * any "non-cluster" activities the compute nodes are involved in, potentially compromises their performance
  * i had no idea how this worked until Scott Knauert put it together

We start by grabbing a surplus computer and install linux on it\\
We add two NIC cards (in our case capable of 100e not gigE)\\
We run a CAT6 cable from a router port to the cluster (this is gigE)\\
And we named this new host **''NAT''**.

<code>
[root@NAT:~]uname -a
Linux NAT 2.6.18-5-686 #1 SMP Fri Jun 1 00:47:00 UTC 2007 i686 GNU/Linux
</code>


===== Interfaces =====

The NAT box will have two interface.  One on our internal VLAN 1, just like our head node ''swallowtail.wesleyan.edu''.  The other one will be on the NFS private network of the cluster.  So basically we have:

  * eth1: 129.133.1.225 
  * eth2: 10.3.1.10

This is defined in (Debian) ''/etc/network/interfaces'' (below).  We choose VLAN 1 since we need to reach a file system hosted by ''vishnu.phy.wesleyan.edu'' in VLAN 90.

<code>
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# Wesleyan
auto eth1
iface eth1 inet static
address 129.133.1.225
netmask 255.255.255.0
gateway 129.133.1.1

# Cluster
auto eth2
iface eth2 inet static
address 10.3.1.10
netmask 255.255.255.0
</code>


===== ipTables =====

Since we are opening up the backend of the cluster's private network, we need to clamp down on the access as much as possible on the NAT box.  IP table chains limit the traffic from and to the private NFS network 10.3.1.xxx and the target host 129.133.90.207.

But the whole intent of the NAT host is to provide a bridge between separate networks.  So any packets that need to traverse this bridge are postrouted or forwarded across.  

  * file ''/etc/init.d/nat''

<code>
#!/bin/bash

#EXTERNAL is the interface to the outside network.
EXTERNAL="eth1"
#INTERNAL is the interface to the local network.
INTERNAL="eth2"

/sbin/depmod -a
/sbin/modprobe ip_tables
/sbin/modprobe iptable_nat
iptables --flush
iptables --table nat --flush
iptables --delete-chain
iptables --table nat --delete-chain

# added source and destination -hmeij
iptables --table nat --source 10.3.1.0/24 --destination 129.133.90.207 \
         --append POSTROUTING --out-interface $EXTERNAL -j MASQUERADE
iptables --source 129.133.90.207 --destination 10.3.1.0/24 \
         --append FORWARD --in-interface $INTERNAL -j ACCEPT

echo "1" > /proc/sys/net/ipv4/ip_forward
</code>

We can now test the setup by contacting the remote host and attempt to mount the remote file system:

<code>
[root@NAT:~]# ping vishnu.phys.wesleyan.edu
PING vishnu.phys.wesleyan.edu (129.133.90.207) 56(84) bytes of data.
64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=0 ttl=63 time=0.235 ms
64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=1 ttl=63 time=0.193 ms
64 bytes from vishnu.phys.wesleyan.edu (129.133.90.207): icmp_seq=2 ttl=63 time=0.115 ms

--- vishnu.phys.wesleyan.edu ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.115/0.181/0.235/0.049 ms, pipe 2

[root@NAT:~]# mount vishnu.phys.wesleyan.edu:/raid/home /mnt
[root@NAT:~]# df -h /mnt
Filesystem            Size  Used Avail Use% Mounted on
vishnu.phys.wesleyan.edu:/raid/home
                      4.6T  1.5T  3.2T  31% /mnt
[root@NAT:~]# umount /mnt
</code>


===== Routes =====

On the compute nodes we now need to change the routing of the packets.  Platform/OCS had already defined a default gateway that pointed back to swallowtail_nfs (10.3.1.254).  We now subsitute the NAT box private IP (10.3.1.10) for the default gateway.  In addition, Platform Support wants to make sure a gateway is defined for the other private network (192.168) so that any ssh callbacks can get resolved. The commands are (added to ''/etc/rc.local''):

<code>
# add for nat box on administrative network
route add -net 192.168.1.0 netmask 255.255.255.0 gw 192.168.1.254 dev eth0
# change default route set by platform/ocs
route add -net default netmask 0.0.0.0 gw 10.3.1.10  dev eth1
route del -net default netmask 0.0.0.0 gw 10.3.1.254 dev eth1
</code>

and now our routing tables on the compute node looks like this:

<code>
[root@compute-1-1 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
255.255.255.255 *               255.255.255.255 UH    0      0        0 eth0
192.168.1.0     swallowtail.loc 255.255.255.0   UG    0      0        0 eth0
192.168.1.0     *               255.255.255.0   U     0      0        0 eth0
10.3.1.0        *               255.255.255.0   U     0      0        0 eth1
169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
224.0.0.0       *               240.0.0.0       U     0      0        0 eth0
default         10.3.1.10       0.0.0.0         UG    0      0        0 eth1
</code>

We should now be able to ''ping'' and ''mount'' the remote host and file system like we did on the NAT box.  ONce we have established connectivity we can redefine the home directory for certain users.


===== AutoFS =====

The whole point of the NAT box is to make the remote home directories available to certain users.  For this to work, the remote host must use the same UID/GID settings as the cluster does.  The cluster uses the UID/GID settings from Active Directory (AD).  Once the UID/GID settigns are synced, we change the home directory location for a user in question.  After those changes, that file needs to be pushed out to the compute nodes and autofs restarted.

<code>

[root@swallowtail ~]# egrep 'hmeij|sknauert' /etc/auto.home

hmeij localhost:/export/home/users/hmeij
sknauert vishnu.phys.wesleyan.edu:/raid/home/templarapheonix

[root@swallowtail ~]# make -C /var/411
[root@swallowtail ~]# /etc/init.d/autofs restart

</code>

Once autofs is reatrted on both the head node and the compute node compute-1-1, we can force the automount to mount the remote home directory
<code>
[root@compute-1-1 ~]# cd ~sknauert
[root@compute-1-1 sknauert]# df -h .
Filesystem            Size  Used Avail Use% Mounted on
vishnu.phys.wesleyan.edu:/raid/home/templarapheonix
                      4.6T  1.5T  3.2T  31% /home/sknauert
</code>


===== Tests =====

So lets write some files on compute node compute-1-1 in the remotely mounted home directory.  Meaning, the packets have to flow over NFS to the NAT box which forwads the packets to the remote host.  For comparison, lets do the same by writing to some other file systems.

<code>
#for i in 1024 10240 102400 1024000; do echo $i; time dd if=/dev/zero of=./out.$i  bs=1k count=$i; done
# ls -lh
-rw-r--r--  1 sknauert  s07          1M Sep 28 10:28 out.1024
-rw-r--r--  1 sknauert  s07         10M Sep 28 10:28 out.10240
-rw-r--r--  1 sknauert  s07        100M Sep 28 10:29 out.102400
-rw-r--r--  1 sknauert  s07       1000M Sep 28 10:41 out.1024000
</code>

^ Where ^ 1024 ^ 10240 ^ 102400 ^ 1024000 ^
|vishnu.phys:/home/sknauert/TEMP|0m0.868s|0m6.404s|1m12.993s|11m59.465s|
|/export/home/rusers/sknauert/TEMP|0m0.027s|0m0.160s|0m01.815s|00m20.484s|
|/sanscratch/TEMP|0m0.108s|0m0.452s|0m01.626s|00m27.664s|
|/localscratch/TEMP|0m0.005s|0m0.038s|0m00.370s|00m07.687s|

These time recordings will wildly vary depending on competing resources ofcourse.  The bottom three file systems are NFS mounted on compute node ''compute-1-1'' from the ''io-node'' over gigabit ethernet.  From the ''io-node'' the connection is 4 gigabits/sec fiber channel to the NetApp filer.

The connection in our test setup is limited by the 100e NIC cards in the NAT box.  Also the remote host has a 100e link to VLAN 90.  We should move these to gigE.


===== Errors =====

''dmesg'' shows ... which puzzle me ...

<code>
eth1: Transmit error, Tx status register 82.
Probably a duplex mismatch. See Documentation/networking/vortex.txt
Flags; bus-master1, dirty 887391(15) current 887391(15)
Transmit list 00000000 vs. c7c30b60. 
0: @c7c30200 length 800005ea status 000105ea 
...
</code>

\\
**[[cluster:0|Back]]**