These topics flowed out of our Design Conference with Dell, read about that on this page.
|As Detailed In Cluster Quote|
|Hard Drive:||80GB, SATA, 3.5-inch 7.2K RPM Hard Drive|
|Additional Storage Products:||80GB, SATA, 3.5-inch 7.2K RPM Hard Drive|
The Dell folks appeared surprised that in the quote the Power Edge 1950 blades for all the compute nodes were outfitted with dual disks (see table above). Only the four heavy weight nodes have a single disk. The dual disks in the Power Edge 2950s (head and io nodes) will ofcourse be striped and mirrored.
On the light weight compute nodes we could thus stripe & mirror the disks so that failure becomes less of an issue. However, and Dell folks agreed, not much would be gained. If a compute node fails, shoot-node would simply re-install the operating system from scratch on the compute node.
Why leave these disks idle? The proposal to mount the disks as /localscratch seemed ok. If users want to use this space it is entirely up to them. It'll be slower than the SAN based /sanscratch, but maybe it makes sense for certain programs. For example, programs could checkpoint into the local space as needed. It's also important to point out that this file system could be useful for any file locking operations since it is local.
So the idea was basically a go.
/localscratch on the heavy weight nodes is ofcourse the MD1000 high performance (dedicated) scratch space.
The design calls for a 10 TB file system to be provided by the SAN. Fiber channel to io node, then NFS exported to all compute nodes. The idea is to create 2 volumes with different properties (ie snapshot etc). The first volume will carry the SAN based (LUN) scratch space.. The second volume will carry the home directories thus needs a backup policy.
This scratch space, /sanscratch, is shared amongst the compute nodes. Thus each node could use whatever is available. My inclination is that 1 TB seems excessive but with 'thin provisioning' and 'autogrow' we can monitor usage.
I NEED TO CHECK THIS WITH MATT (thanks matt — Henk Meij 2007/01/29 11:39)
|Matt & I just realized that if it were required, an fsck (check and repair file system) operation on a 10 TB file system could take hours/days/weeks? We need to assess this. Click here to read about a suggested approach.|
Matt here - with thin provisioning you can make the space to the OS appear to be as arbitrarily large as you would like, without it actually taking up that space. So in theory you could make the scratch space 10TB but make it thin provisioned so that in reality it would only take up what was on it. Then that could be monitored with expansions happening as needed - as far as the OS would know, it would be fine… unless the volume/lun ran out of space to write, in which case the io node could crash or simply offline the LUN - I've not seen the former, only the latter.
If you wanted to keep to the policy of making a small LUN which would then be expanded accordingly, you would need downtime of that file system for each expansion, but would otherwise quite workable.
Another possibility would be to make the LUN appear to be 10TB, but then partition and filesize off just a small portion of that and increase that as necessary - this would mean you'd have an OS level quota control (of sorts) going on to keep yourself sane. This resizing can be done live, though perhaps only through LVM, I've not tried doing it on a raw ext3 partition; since RHEL4 by default wants you to use LVM for everything, this may be a complete non-issue.
— Matthew Elson 2007/01/29 09:14
This is a big question. More on this later. My gut reaction is, lets have Dell configure the cluster per the design as it is now. If we feel at some later date that we can bypass the io node and NFS directly with better performance we can change the configuration. In fact, we can perform tests and then take action.
We can also remain flexible. Some compute nodes could be directly mounted to the SAN for NFS, some not. Also, if we entirely bypassed the compute node in the future, we would essentially have a free compute node (for all debug queues?). Or we could treat it as a “backup head node”. Or we could have 2 front end nodes if our users heavily load the single head node. One could be for compiling activities, and one could be for batch scheduling.
I think my preference is to let Dell do the io node installation per the current design. Once we have some usage statistics and experience, we would have all the information for a redesign. BTW: mounting from SAN directly would require a dedicated interface on the SAN.
If we added a second private network from the very start we could initiate that into the design. Some configurations will be necessary so that the batch scheduler does not get confused between the multiple interfaces on each host. I'm pressing hard for this as the Platform/OCS documentation suggests a 30% bandwidth improvement to the compute nodes. And the cost is minimal. A Dell 2768 switch is listed at around $800. In addition we would need a third NIC for the head node (another $100). And some cabling. Why not, i ask?
The NFS traffic would flow over the premier Cisco ethernet switch with redundant power supplies etc. The administrative traffic would flow over this cheaper Dell ethernet switch. It could fail. When it does, we could either a) have a duplicate on hand and switch it out, b) put some support contract on it, or c) reconfigure the cluster to use the Cisco ethernet switch for all traffic (i envision changes to /etc/fstab, pushing that out with mount/remount activities).
For a $1,000 …lets just do this. In fact, we would gain a ton of flexibility.
This Roll can be installed by Dell. Licenses would need to be obtained for the components required. Basically we need to choose here, if we go with Portland Compilers or the Intel Compilers Roll. The Intel Software Roll contains the Intel Compilers and the Intel Cluster Tools
General consensus appears that there is no preference. The Gaussian code users use the Portland Compiler but on their own computers. So perhaps the best approach is to install the Portland Compiler evaluation license first, and let users do some testing. Following that period, we could install the Intel Compiler evaluation license for comparative testing. Then we make a decision. — Henk Meij 2007/01/31 09:27
The focus of this issue is somewhat unrelated to the design of the cluster. But we need to avoid creating a future headache. User accounts are created on the head node using the useradd command. Then the /etc/passwd, /etc/shadow and /etc/group files are pushed to the compute nodes using 411. That enables users to compile programs on the head node, and then execute those programs on the compute nodes.
However. Simply running useradd will grab UID/GID locally. Now think about that. A 10 TB file system, filled with locally generated ids … and then the users transfer files back and forth between other wesleyan servers. So if we're smart, we need to wrap the useradd with a script that consults our Active Directory (AD), obtains the UID/GID for a specific user to be created and feeds that to account creation process. Voila. User and group permission heaven.
That way we also implicitly set up some security amongst our users meaning all users are matched with their departmental organization units (BIOL, PHYS, ECON etc). In addition, we have verified the cluster users are part of our wesleyan domain.
So for guest accounts, we can create local entries. We *could* also use AD, depends on how many there will be. But we need some way of tracking these accounts. One proposal would be this: user email@example.com, a serious researcher studying the implications of stochastically processing email, wants to collaborate with three other buddies.
melson@swallowtail already has an AD assigned UID/GID. Next we create Xmelson, Ymelson and Zmelson locally on the cluster. The UIDs and GIDs for these accounts are locally defined, lets say in range 100-199. The user melson is added to these groups if file sharing is needed.
This would allow us to: