Why? Here is my summation of some items i wish to take advantage of: Link
We're running Platform/OCS which includes the Lava scheduler. It's sorta like LSF but with functionality removed. However, it is free and very good. Our Dell cluster came pre-configured with Lava but it's time to leverage the resources of our cluster in more detail.
What version to upgrade to? Well i thought that would be easy, the latest stable version which is LSF v7.0.1. But that has complications. Our cluster Platform/OCS is a “rocks” based installation. For more reading: Link
Our OCS version is 4.1.1 and the only “roll” available for LSF/HPC is the 6.2 version. That's fine, because all the functionality i'm after exists in this version.
In order to install a v7 “roll”, we'd have to upgrade to OCS 4.4 and that means a huge disruption. All compute nodes would have be re-imaged and re-deployed. Hence my preference to upgrade via the roll method to v6.2, as advised by Platform support.
Another option is to perform a manual install of v7 from source. However this would involve some significant work as the scheduler software is installed locally on each node and head node in the OCS environment. I'd like to keep it that way to reduce NFS traffic and not install the scheduler software on a shared filesystem.
#0 Shut off NAT box
Reset root password, shut off box.
#1a Close the head node to all ssh traffic (firewall to trusted user VLAN access only).
#1b Inactivate all queues, Backup scratch dirs, Stop all jobs.
bjobs -u all) so users can view what was running … info is here External Link
#1c Backup up all files needed to rebuild the io-node.
The io-node is currently a compute node (but not a member of any queue and admin_closed). It has fiber channel (2 cards) to the Netapp storage device. A re-image of this implies rebuilding the NFS environment and reconfigure dm-multipath – so the entire cluster needs to go idle for that. Lets rebuild so we might use it if needed as compute node or backup/failover head node. Rebuilding is documented on the https://itsdoku.wesleyan.edu.
#1d Backup up all files needed to rebuild the compute nodes.
This includes two varieties of nodes: a light weight node and a heavy weight node sample. Some of this should be customized with extend-compute.xml (minor chances for now …). Rebuilding is documented on the https://itsdoku.wesleyan.edu.
#1e Stop the lava system across the cluster.
#1f Backup all files in /opt/lava.
⇒ copy the /opt/lava/work/lava/logdir/lsb.acct* files to the archive.
LSF/HPC will install in /opt/lsfhpc but make sure you have remote backup copy of /opt/lava … rsync to /mnt/cluster/opt_lava_pre_lsf.
→ Disable Tivoli agents and start a manual incremental backup.
#1g Unmount all io-node exported file systems, leave nodes running.
We'll force a reboot followed by a re-image later in staggered fashion after we are done with the LSF install. But first unmount the NFS file systems with cluster-fork on all compute nodes & head node.
#1h Good time to clean all orphaned jobs' working dirs in /localscratch & /sanscratch!
→ fix: set this LUN to space reservation enabled (1TB)
#1i Unmount all multipathed LUN filesystems on io-node (/dev/mapper).
⇒ <hi #ff0000> AFTER THAT DISCONNECT THE FIBER CABLES </hi>
Node re-imaging involves formatting and partitioning. We do not want to risk loosing any data because of snafus.
#2. Remove the lava roll.
chkconfig lava off
chkconfig lavagui off
rollops –remove lava
3. Add the LSFHPC roll.
rollops –add lsfhpc –iso lsfhpc-4.1.1-0.x86_64.disk1.iso
4.Prep ENV and license info.
Edit /share/apps/scripts/cluster_bashrc (and cluster_cshrc)
Change this section to the appropriate lsf location:
# source the job scheduler environment; sh, ksh or bash if [ -f /opt/lava/conf/profile.lsf ]; then . /opt/lava/conf/profile.lsf fi
Source that new environment.
which lmgrd should return the lsfhpc version.
Next copy the license info to /opt/lsfhpc/conf/license.dat.
5a. Start the license daemon … port 1700 is currently free.
su - flexlm -c “lmgrd -c /opt/lsfhpc/conf/license.dat -l /tmp/lsf_license.log”
5b. Add this startup command to /etc/rc.local with full path to lmgrd.
5c. Check the license daemons:
6. Assign compute nodes to additional resources.
insert into app_globals (service,component,values) values
('Info','DefaultLSFHostResource','mvapich lammpi mpichp4');
This will add the Infiniband MPI implementation.
<hi #ffff00>#7a.</hi> Re-image the io node.
⇒ Before you do this, redefine the io node as a compute appliance in the cluster database and turn
/etc/init.d/rocks-grub start on.
/boot/kickstart/cluster-kickstart or reseat the power cords.
Once done, mount all NFS file systems on the head node.
⇒ Redefine the io node as a “nas appliance” in the cluster database and:
chkconfig rocks-grub off
<hi #ffff00>#7b.</hi> Re-image all the compute nodes.
/boot/kickstart/cluster-kickstart or use ipmitools and cycle the power while nodes are running. This mimics an unclean shutdown forcing a re-imaging.
<hi #ffff00> Add the memory modules at this time? </hi>
#8. Starting and testing the LSF HPC Cluster.
7a & 7b should add the nodes to the LSF cluster. Check the hosts file because we have 2 NICs per node. Make sure the nodes are registered as hostname.local meaning the 192.168.1 subnet.
On head node:
chkconfig lsf on
After this is done, and all nodes are back up, walk by the lava configuration files and add information that is missing to the LSF equivalent files.
On head node:
#9. Configure Master fail over.
Skip this step.
#10. “Go To 1”
Walk through the items in #1 and enable/reset functionality like cronjobs, tivoli, queues, ssh access …
Kick off Tivoli for an automated backup.
Test some job submissions …
Document the new MPI job submission procedure …
Add our eLIM after a while …
#11. Relocate some home directories.
#12. NAT box.
Reconfigure compute-1-1 for Scott, maybe.
So how long does this take:
— Meij, Henk 2007/11/20 09:42
The depts of CHEM and PHYS will each contribute $2,400 towards the purchase of additional memory. ITS will contribute $2,880. Thank you all! Since a 1 GB DIMM cost $60 and a 2 GB DIMM costs $120, we'll buy the latter only. A 4 GB DIMM costs $440 which is a substantial increase.
The $7,680 is enough to purchase 64 DIMMs adding 128 GB of memory to the cluster. Cluster wide then, the nodes will hold 320 GB of memory. The question is in what configuration? Here is a constraint … Link
|“Memory modules must be installed in pairs of matched memory size, speed, and technology, and the total number of memory modules in the configuration must total two, four, or eight. For best system performance, all four, or eight memory modules should be identical in size, speed, and technology. … System performance can be affected if your memory configuration does not conform to the preceding installation guidelines.”|
The 4 heavy weight nodes, with local dedicated fast disks, will not be changed. They currently contain a DIMM configuration of 8×2, thus 16 GB of memory each. All DIMM slots are filled. All 32 light weight nodes nodes currently hold a 4×1 DIMM configuration, thus 4 GB of memory each.
So the first suggestion is to remove the 1 GB DIMMs from the 16 gigE enabled nodes (queue
16-lwnodes) and add them to the Infiniband enabled nodes (queue
16-ilwnodes). That would make each Infiniband enabled node hold 8 GB of memory (8×1 configuration). It would fill their slots. A parallel job could access lots of memory across these nodes.
That then leaves 16 empty nodes and 64 2GB DIMMs to play with. What to do?
Here are some options.
|Scenario A||uniform, matches infiniband nodes|
|64<hi #ffff00>1</hi>||16<hi #ffff00>2</hi>||4×2<hi #ffff00>3</hi>||128<hi #ffff00>4</hi>||“ sixteen 8 GB medium weight nodes ”|
|Scenario B||add equal medium and heavy nodes|
|16||08||2×2||64||“ eight 4 GB light weight nodes ”|
|16||04||4×2||32||“ four 8 GB medium weight nodes ”|
|32||04||8×2||32||“ four 16 GB heavy weight nodes ”|
|Scenario C||emphasis on medium nodes|
|08||04||2×2||32||“ four 4 GB light weight nodes ”|
|40||10||4×2||80||“ ten 8 GB medium weight nodes ”|
|16||02||8×2||16||“ two 16 GB heavy weight nodes ”|
<hi #ffff00>1</hi> Number of DIMMs. This must total 64 within each scenario.
<hi #ffff00>2</hi> Number of nodes. This must total 16 within each sceanrio.
<hi #ffff00>3</hi> Memory pairs, one of these combinations: 2×2, 4×2, or 8×2.
<hi #ffff00>4</hi> Number of cores. This must total 128 within each scenario.
Actually, the perfect argument for B was offered by Francis:
|If machines have 8 GB of RAM, 1 job locks up the node. So two jobs lock up 2 nodes, rendering a total of 14 cores unused and unavailable. Suppose instead we have 16GB machines. Two jobs would lock up just one machine, leaving only 6 cores unused and unavailable. This would seem to make better use of resources.|
In Scenario A above nothing really changes but the concept of a “light weight” node. It now would be an 8 GB memory footprint node versus the old value of 4 GB. So, no queues need to be renamed. One exception to that: the
04-hwnodes and the
gaussian queue comprise the same number of hosts, the heavy weight nodes. Perhaps we should remove the
gaussian queue. Gaussian jobs can be run on any node, light or heavy weight, Infiniband or gigE enabled.
In Scenario B & C, things change. Now we have light, medium and heavy weight nodes. One naming convention we could adopt could be:
|queue_name||=||number of nodes||+||which switch||+||GB mem per node||+||total cores||+||additional info||;|
Then our queues could be named like so:
|16i08g128c||16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total|
|08e04g064c||08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total|
|04e08g032c||04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total|
|04e16g032c||04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total|
|04e16g032cfd||04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total||fast local disk access|
Or is this too cumbersome? Maybe.
Perhaps just an abbreviation:
|imw||16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total|
|elw||08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total|
|emw||04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total|
|ehw||04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total|
|ehwfd||04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total||fast local disk access|
|NEW QUEUES all priority = 50|
|imw||compute-1-1 … compute-1-16|
|elw||compute-1-17 … compute-1-24|
|emw||compute-1-25 … compute-1-27 compute-2-28|
|ehw||compute-2-29 … compute-2-32|
|ehwfd||nfs-2-1 … nfs-2-4|
|matlab||imw + emw|
delete queues: idle, [i]debug, molscat, gaussian, nat-test