User Tools

Site Tools



Upgrading to LSF

Why? Here is my summation of some items i wish to take advantage of: Link

We're running Platform/OCS which includes the Lava scheduler. It's sorta like LSF but with functionality removed. However, it is free and very good. Our Dell cluster came pre-configured with Lava but it's time to leverage the resources of our cluster in more detail.

First Stumble

What version to upgrade to? Well i thought that would be easy, the latest stable version which is LSF v7.0.1. But that has complications. Our cluster Platform/OCS is a “rocks” based installation. For more reading: Link

Our OCS version is 4.1.1 and the only “roll” available for LSF/HPC is the 6.2 version. That's fine, because all the functionality i'm after exists in this version.

In order to install a v7 “roll”, we'd have to upgrade to OCS 4.4 and that means a huge disruption. All compute nodes would have be re-imaged and re-deployed. Hence my preference to upgrade via the roll method to v6.2, as advised by Platform support.

Another option is to perform a manual install of v7 from source. However this would involve some significant work as the scheduler software is installed locally on each node and head node in the OCS environment. I'd like to keep it that way to reduce NFS traffic and not install the scheduler software on a shared filesystem.

Next Step

  • process a License Change request via Platform
  • obtain a new license file
  • download the LSF/HPC v6.2 roll at MyPlatform
  • plan the upgrade steps

Lots of Next Steps

#0 Shut off NAT box

Reset root password, shut off box.

#1a Close the head node to all ssh traffic (firewall to trusted user VLAN access only).

#1b Inactivate all queues, Backup scratch dirs, Stop all jobs.

  • Make friends! Tar up contents of /localscratch & /sanscratch before stopping jobs …
  • Take a snapshot of jobs running (like the clumon jobs page or with bjobs -u all) so users can view what was running … info is here External Link
  • Stop all jobs: bkill 0
  • Disable any cronjobs.
  • reset sknauert's home dir.

#1c Backup up all files needed to rebuild the io-node.

The io-node is currently a compute node (but not a member of any queue and admin_closed). It has fiber channel (2 cards) to the Netapp storage device. A re-image of this implies rebuilding the NFS environment and reconfigure dm-multipath – so the entire cluster needs to go idle for that. Lets rebuild so we might use it if needed as compute node or backup/failover head node. Rebuilding is documented on the

#1d Backup up all files needed to rebuild the compute nodes.

This includes two varieties of nodes: a light weight node and a heavy weight node sample. Some of this should be customized with extend-compute.xml (minor chances for now …). Rebuilding is documented on the

#1e Stop the lava system across the cluster.

/etc/init.d/lava stop

  • also on ionode!!
  • also on the head node!!
  • also on the head node run /etc/init.d/lavagui stop

#1f Backup all files in /opt/lava.

⇒ copy the /opt/lava/work/lava/logdir/lsb.acct* files to the archive.

LSF/HPC will install in /opt/lsfhpc but make sure you have remote backup copy of /opt/lava … rsync to /mnt/cluster/opt_lava_pre_lsf.

→ Disable Tivoli agents and start a manual incremental backup.

#1g Unmount all io-node exported file systems, leave nodes running.

We'll force a reboot followed by a re-image later in staggered fashion after we are done with the LSF install. But first unmount the NFS file systems with cluster-fork on all compute nodes & head node.

#1h Good time to clean all orphaned jobs' working dirs in /localscratch & /sanscratch!

→ fix: set this LUN to space reservation enabled (1TB)

#1i Unmount all multipathed LUN filesystems on io-node (/dev/mapper).


Node re-imaging involves formatting and partitioning. We do not want to risk loosing any data because of snafus.

#2. Remove the lava roll.

chkconfig lava off
chkconfig lavagui off
rollops –remove lava

3. Add the LSFHPC roll.

rollops –add lsfhpc –iso lsfhpc-4.1.1-0.x86_64.disk1.iso

4.Prep ENV and license info.

Edit /share/apps/scripts/cluster_bashrc (and cluster_cshrc)
Change this section to the appropriate lsf location:

# source the job scheduler environment; sh, ksh or bash
if [ -f /opt/lava/conf/profile.lsf ]; then
        . /opt/lava/conf/profile.lsf

Source that new environment. which lmgrd should return the lsfhpc version.
Next copy the license info to /opt/lsfhpc/conf/license.dat.

5a. Start the license daemon … port 1700 is currently free.

su - flexlm -c “lmgrd -c /opt/lsfhpc/conf/license.dat -l /tmp/lsf_license.log”

5b. Add this startup command to /etc/rc.local with full path to lmgrd.

5c. Check the license daemons: lmstat

6. Assign compute nodes to additional resources.

insert into app_globals (service,component,values) values
('Info','DefaultLSFHostResource','mvapich lammpi mpichp4');

This will add the Infiniband MPI implementation.

<hi #ffff00>#7a.</hi> Re-image the io node.

⇒ Before you do this, redefine the io node as a compute appliance in the cluster database and turn /etc/init.d/rocks-grub start on.

/boot/kickstart/cluster-kickstart or reseat the power cords.

Once done, mount all NFS file systems on the head node.

⇒ Redefine the io node as a “nas appliance” in the cluster database and:

/etc/init.d/rocks-grub stop
chkconfig rocks-grub off

<hi #ffff00>#7b.</hi> Re-image all the compute nodes.

/boot/kickstart/cluster-kickstart or use ipmitools and cycle the power while nodes are running. This mimics an unclean shutdown forcing a re-imaging.

<hi #ffff00> Add the memory modules at this time? </hi>

#8. Starting and testing the LSF HPC Cluster.

7a & 7b should add the nodes to the LSF cluster. Check the hosts file because we have 2 NICs per node. Make sure the nodes are registered as hostname.local meaning the 192.168.1 subnet.

On head node:
chkconfig lsf on
/etc/init.d/lsf restart

After this is done, and all nodes are back up, walk by the lava configuration files and add information that is missing to the LSF equivalent files.

On head node:
lsadmin reconfig
badmin mbdrestart
badmin reconfig

#9. Configure Master fail over.

Skip this step.

#10. “Go To 1”

Walk through the items in #1 and enable/reset functionality like cronjobs, tivoli, queues, ssh access …

Kick off Tivoli for an automated backup.

Test some job submissions …

Document the new MPI job submission procedure …

Add our eLIM after a while …

#11. Relocate some home directories.

  • “pusers” LUNs fstarr & bstewart … morph into “rusers?” LUNs
  • relocate lvargarslara, skong & ztan to another LUN

#12. NAT box.

Reconfigure compute-1-1 for Scott, maybe.

So how long does this take:

  • one morning to install LSF/HPC + rebuild the ionode
  • one afternoon to rebuild all other nodes (and deal with unexpected hardware problems)
  • one morning to open every node and remove/add memory sticks

Meij, Henk 2007/11/20 09:42

Adding Memory

The depts of CHEM and PHYS will each contribute $2,400 towards the purchase of additional memory. ITS will contribute $2,880. Thank you all! Since a 1 GB DIMM cost $60 and a 2 GB DIMM costs $120, we'll buy the latter only. A 4 GB DIMM costs $440 which is a substantial increase.

The $7,680 is enough to purchase 64 DIMMs adding 128 GB of memory to the cluster. Cluster wide then, the nodes will hold 320 GB of memory. The question is in what configuration? Here is a constraint … Link

“Memory modules must be installed in pairs of matched memory size, speed, and technology, and the total number of memory modules in the configuration must total two, four, or eight. For best system performance, all four, or eight memory modules should be identical in size, speed, and technology. … System performance can be affected if your memory configuration does not conform to the preceding installation guidelines.”

The 4 heavy weight nodes, with local dedicated fast disks, will not be changed. They currently contain a DIMM configuration of 8×2, thus 16 GB of memory each. All DIMM slots are filled. All 32 light weight nodes nodes currently hold a 4×1 DIMM configuration, thus 4 GB of memory each.

So the first suggestion is to remove the 1 GB DIMMs from the 16 gigE enabled nodes (queue 16-lwnodes) and add them to the Infiniband enabled nodes (queue 16-ilwnodes). That would make each Infiniband enabled node hold 8 GB of memory (8×1 configuration). It would fill their slots. A parallel job could access lots of memory across these nodes.

That then leaves 16 empty nodes and 64 2GB DIMMs to play with. What to do?
Here are some options.

Scenario A uniform, matches infiniband nodes
64<hi #ffff00>1</hi> 16<hi #ffff00>2</hi> 4×2<hi #ffff00>3</hi> 128<hi #ffff00>4</hi> “ sixteen 8 GB medium weight nodes ”
Scenario B add equal medium and heavy nodes
16 08 2×2 64 “ eight 4 GB light weight nodes ”
16 04 4×2 32 “ four 8 GB medium weight nodes ”
32 04 8×2 32 “ four 16 GB heavy weight nodes ”
Scenario C emphasis on medium nodes
08 04 2×2 32 “ four 4 GB light weight nodes ”
40 10 4×2 80 “ ten 8 GB medium weight nodes ”
16 02 8×2 16 “ two 16 GB heavy weight nodes ”
Scenario D

<hi #ffff00>1</hi> Number of DIMMs. This must total 64 within each scenario.
<hi #ffff00>2</hi> Number of nodes. This must total 16 within each sceanrio.
<hi #ffff00>3</hi> Memory pairs, one of these combinations: 2×2, 4×2, or 8×2.
<hi #ffff00>4</hi> Number of cores. This must total 128 within each scenario.

  • Personally, i was initially leaning towards A.
  • But now, viewing this table, i like the distribution of cores across light, medium and heavy weight nodes in B.
  • C really depends on if we need 8 GB nodes. Not sure why we would do this vs A.

Actually, the perfect argument for B was offered by Francis:

If machines have 8 GB of RAM, 1 job locks up the node. So two jobs lock up 2 nodes, rendering a total of 14 cores unused and unavailable. Suppose instead we have 16GB machines. Two jobs would lock up just one machine, leaving only 6 cores unused and unavailable. This would seem to make better use of resources.

Renaming Queues

In Scenario A above nothing really changes but the concept of a “light weight” node. It now would be an 8 GB memory footprint node versus the old value of 4 GB. So, no queues need to be renamed. One exception to that: the 04-hwnodes and the gaussian queue comprise the same number of hosts, the heavy weight nodes. Perhaps we should remove the gaussian queue. Gaussian jobs can be run on any node, light or heavy weight, Infiniband or gigE enabled.

In Scenario B & C, things change. Now we have light, medium and heavy weight nodes. One naming convention we could adopt could be:

queue_name = number of nodes + which switch + GB mem per node + total cores + additional info ;

Then our queues could be named like so:

16i08g128c 16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total
08e04g064c 08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total
04e08g032c 04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total
04e16g032c 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total
04e16g032cfd 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total fast local disk access

Or is this too cumbersome? Maybe.
Perhaps just an abbreviation:

imw 16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total
elw 08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total
emw 04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total
ehw 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total
ehwfd 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total fast local disk access
NEW QUEUES all priority = 50
imw compute-1-1 … compute-1-16
elw compute-1-17 … compute-1-24
emw compute-1-25 … compute-1-27 compute-2-28
ehw compute-2-29 … compute-2-32
ehwfd nfs-2-1 … nfs-2-4
matlab imw + emw

delete queues: idle, [i]debug, molscat, gaussian, nat-test


cluster/52.txt · Last modified: 2007/11/20 10:18 (external edit)