No kidding. Moving from Open Cluster Software stack (OCS 4.1.1) with Redheat Enterprise Linux AS (RHEL 4.3) towards OCS 5.1 with RHEL 5.1 means a ton of changes. We'll also migrate from Load Scheduler Facility (LSF) version 6.2 to version 7.0.2.
The approach taken is to make the old ionode server a new head node, petaltail.wesleyan.edu, and grab a couple of compute nodes and set up and configure a tiny new cluster. Once all the software has been recompiled and the scheduler is up and running, we'll open it up for testing. Then we need to decide how we add the other compute nodes.
Below are some mental notes of all the changes which will impact the use of the cluster. Details added as we progress.
In the end I'm anticipating to make swallowtail the master LSF node and commercial software license server. So nothing changes for the users. Host petaltail will become a backup LSF master node and will also act as the installer node, that is, the node that manages and images the compute nodes. We can entertain the idea of also making petaltail a compute node (perhaps allocate 4 job slots).
Dear Swallowtail Users,
The new cluster environment is taking shape. Our new head node, our old io-node, is called petaltail.wesleyan.edu and can be treated similar to swallowtail. A petaltail is a very, very tiny dragonfly found in the local area. Two compute nodes are attached, one on infiniband and one on the ethernet switch. LSF has been upgraded to 7.0. I'm currently running a stress test so do no be alarmed by the massive amounts of jobs flying by.
In order to get familiar with this new host please read https://dokuwiki.wesleyan.edu/doku.php?id=cluster:72 (there are more new pages on the main page too).
Most of what will change are paths to software locations. Since that was forced upon us, an entire overhaul of the software area was undertaken. All software was recompiled against a default compiler unless so noted. Also all MPI libraries are compiled against a default flavor supporting both infiniband and ethernet switches.
I'm requesting that each of you run small programs and test the software you are using. Duplicate results obtained on swallowtail on this new host. Then a month from now we can decide to take a non-reverseable step and ingest the other compute nodes.
In the end swallowtail will be a backup LSF master candidate and hold the commercial software packages.
I can meet with individuals as needed this month to work on problems found. I will publish problems i receive on the bottom of this page so you can check if a bug you found was already reported.
It is time to plan the migration of the compute nodes from swallowtail to petaltail. This migration is destructive, iow nodes will be rebuild and joined to this new head node running Open Cluster Stack v5.1 and Redhat Linux 5.1 and Load Scheduler Facility 7.0. More on this topic
I will need half+ of a workday to do this with follow up testing. Are there any conflicts if I pick a day between June 6th and June 14th. Please email me suggestions directly. The steps of the migration will roughly go like this:
- block ssh access to swallowtail
- halt jobs, halt compute nodes
- reimage nodes using petaltail
- reconfigure LSF queues (&licenses, remove & add)
- open petaltail for business (jobs can be submitted)
- keep swallowtail up for a few days (no access)
- reimage swallowtail as master LSF candidate
- reposition swallowtail as LSF master, petatail as LSF secondary (and repository manager)
During the time that swallowtail is locked and until it is reimaged, Matlab and Stata will not be accessible hence their queues will be closed (because of license issues).
Between now & then, please test the new cluster environment.
Ah yes, sorry. The package wants to name nodes in the following manner compute-RR-NN where RR stands for the rack number and NN is the node number. Dropped the rack identifier. Since i have to assign new IPs to these compute nodes, I started at 192.168.1.100 / 10.3.1.100 and the nodes will be named compute-00, compute-01 … compute-35. And yes, i tried to start at 101 but have no control over where NN starts.
You may also use the following formats: c-NN or cNN. LSF will use shortest format internally.
On swallowtail users home directories were located in places like
autofs then would automatically mount to
/home/username when a user logs in. This sometimes became problematic when
autofs became confused and would then hang the entire head node.
In the petaltail environment users' home directories are located directly in
/home making administration much simpler. For our purposes, we'll then mount the NAS based home directories directly on top of
/home. We'll do this from the NAS to each compute node eliminating
autofs configuration changes.
This is the area were all the software is located. In swallowtail's environment, it was exported from head node to each compute node. In petaltail's environment I have moved this to
/home/apps (provided a link for backward compatibility) and thus it will mounted directly from NAS to nodes.
All the new software is compiled with Intel's compiler listed above. Unless instructed to use
gcc, this is the default compiler. For Amber users, the software was compiled with both v9 and v10.
Parallel compilations (Amber and NAMD for example) were performed using OpenMPI only. This enables these programs to run both on the Infiniband and gigabit Ethernet switches. Other MPI flavors are available if needed, please consult the Software Page for more information.
Topspin is deprecated. However, i have copied over this MPI flavor in case users can not recompile against OpenMPI.
IP traffic over IB device. This basically lets a parallel program that was compiled for Ethernet switches (TCP/IP) run across an optimized switch for parallel programs such our Infiniband switch. This is done by assigning an IP address to the ib0 interface. The TCP/IP packets are encapsulated inside IB packets and shipped over the switch. This would add traffic to the Infiniband switch ofcourse if used. The configuration of it requires a re-imaging so decided to add this functionality now.
It is important that all software gets tested before we make the final switch.
Here is the list: Internal Link
Commercial software, because of license tie-ins with hostname, will be dealt with after the switch.
/home/apps/bin/lsf.openmpi.wrapperwhich is a link to
/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper. You can find more wrapper scripts in that location.
There do not appear to be significant changes between LSF 6.2 and 7.0
Documentation can be found at this location: External Link
The Clumon Dashboard is gone. This is too bad as the graphical display was quite useful. I will attempt at either:
For those that are interested, here are some notes on what changed related to system administration. Basically, everything changed!
The Installer node, the one that manages the operating systems for imaging the compute nodes can now be a standalone host. This is probably handy when you have dozens of operating systems and version to maintain.
Also very handy. Using Kusu templates, one can customize every type of nodegroup desired. For example, in our case all compute nodes have dual NICs so we create a template and then clone compute nodes from it. The head node has 4 NICs, so we create a template for it and can now easily clone the head node. Etc.
We are deploying an RPM based installation using Redhat Enterprise Linux 5.1. Basically what this means is that the head node contains all redhat packages in a
depot managed by
repoman. Want a package installed for a certain nodegroup, simply use Kusu and done.
In an RPM based installations, upon re-imaging, each package needs to be retrieved and installed. Takes roughly 5 minutes per node. Again not a big deal for us. Disk imaged means that an image only is transferred, et voila, the node is up.
Diskless might be interesting to investigate. Upon re-imaging the entire operating system image is stored in memory. No more failing disks, less power consumption, smaller footprints of nodes. Sort of a virtualized compute cluster. Down side is the memory requirements must be sufficient for both jobs and OS. But memory is cheap.
pdsh, parallel distributed shell
CFM, Cluster File Management, is a huge new product providing lots of flexibility managing files across the nodes. It basically creates a directory for each nodegroup. Inside that directory links are made to for example
/etc/hosts. If that link exists in the nodegroup, cfmsync then pushes that file to each node who is a member of that nodegroup. Or you can create real files like
fstab.append and the content of that file is then appended to the original file upon syncing.
Have not implemented this yet. There are a couple of places where users need to provide passwords not using SSL. So until those are SSL enabled it might be better to not provide your Wesleyan AD password.
A very nice web based OCS Console that ties together OCS administration (maintenance for depot, networks, nodegroups, etc), LSF scheduler, Cacti and Nagios Alert and Monitoring tools. Admins only at Platform Management Console
…sendmail not configured correctly to hand off email to mail-int.wesleyan.edu, fixed.