DokuWiki

LSF for HPC

Platform Computing's flagship product is the Load Scheduler Facility for high performance computing: LSF for HPC.

Our cluster, driven by Platform's Open Cluster Stack, Platform/OCS, uses a derivative of the LSF product (called Lava) that is freely available. It is actually LSF v6.1 but i have since found out that such a statement is misleading. It is LSF v6.1 with a ton of functionality removed. Period.

So here is my list of arguments why we should upgrade. The right time is now, before our cluster is heavily used and busting at the seams.

Parallel Computing

Lava does not support a parallel computing environment.

We can run parallel jobs with Lava, but once submitted the Lava daemons have no knowledge of where what runs regarding the parallel job. It also can not assess if the user MPI daemons shut down properly. In order to run parallel jobs with Lava, we invoke a wrapper script, mpirun, that builds a “machinefile” on the fly of the hosts available for a particular job. That's it.

Another constraint is the lack of control of the individual parallel tasks. For example, Gaussian users would prefer to request 8 cores and have all those allocated on the same host. So, bsub -n 8 … with possibly another resource requirement of run only if more 12 Gb of memory is available; like so bsub -R “mem>12000” … . Although you can request the memory resource, Lava has no control over the individual parallel tasks. So if one of heavy weight nodes allocated to this job is running another job, Lava would assign 7 cores on hostA and 1 additional core on hostB to this job.

With the resource string bsub -n 8 -R “span[hosts=1] && mem>12000” … the user can request to run the job if all 8 cores can be allocated on the same host and use 12gb memory. Another useful option is bsub -X … which implies exclusive use of this host (ie schedule no other jobs while my job is running). This option, EXCLUSIVE=Y, is applied to the queue in the config files.

Another approach is to use the Infiniband device effectively by spreading the parallel tasks across multiple hosts. So for example bsub -n 16 -R “span=[ptile=2] && mem>500” … implies … run my job and allocated no more than 2 cores/host (requiring a total of 8 hosts) whom all have 500 mb of memory available on each host.

Lava's default, and only, behavior is to find idle hosts and fill the first host up, then fill the second host up, etc. This results in an uneven distribution of parallel tasks. It also hampers the ability of jobs to run that need a large number of job slots.

Another annoying quirk of Lava, which i hate to document, is that users needing to run parallel jobs need full ssh access to the compute nodes. This is needed so that the scheduler can start the daemons on behalf of the users. That's irritating. LSF v6+ gets around that with eiter lsfrun or blaunch.

Other options not available in Lava are:

PARALLEL_SCHED_BY_SLOT = Y

Implies that scheduling would use available job slots, not processors, to make a decision which hosts to use. Some sites “overload” their hosts which is something we can experiment with if we had LSF. Currently each host has 8 cores thus 8 job slots. But we could define each host to have 12 or 16 job slots. As long as the nodes do not swap heavily (read extend run time of jobs) this would increase job throughput.

RESOURCE_RESERVE_PER_SLOT = Y
SLOT_RESERVE = MAX_RESERVE_TIME[integer]

A parallel job that needs for example 32 job slots may reserve cores, that is flag them as used, during the time specified, for example 24 hours. If the pending job accumulates 32 slots, the job runs. If less that 32 slots are reserved and 24 hours passes, all flagged slots are released and a new reservation window starts. Not supported in Lava.

Another oddity. Becuase Lava is parallel blind, when the user issues a bkill it typically does not clean up all the processes. I've been assured that LSF does handle this correctly and seriously reduces the problem of orphaned processes.

Event Logging

LSB_LOCALDIR=/share/apps/logs/lsb.events

lsb.events is the crucial file containing the status of all jobs in the system that are not done. If this goes corrupt, LSF basically stops and has to start all over. Duplicate event logging guards against this with several checks and balances. Not supported in Lava.

"On Demand" Resources

With classroom use of the cluster anticipated, it will be impossible to dynamically allocated reosurces for classes when they need it. Not necessarily during class time, but suppose an assignment is due at a certain date. With LSF it is possible to set queues characteristics such as

PREEMPTION = PREEMPTIVE[queue_name]
FAIRSHARE = USER_SHARES[user_names, group_names]

USER_ADVANCED_RESERVATIONS = Y

Begin ResourceReservation
NAME        = dayPolicy
USERS       = user1 user2   
HOSTS       = hostA hostB 
TIME_WINDOW = 8:00-18:00     # weekly recurring reservation
End ResourceReservation

The combination of these settings in LSF will allow the cluster to operate on a mode of business as usual. But when certain jobs arrive at the queue they may preempt currently running jobs. Those “preemptable” jobs will be suspended until the priority jobs have finished and resume execution.

Another nice feature is advanced reservations in which certain users or groups have been granted the rights to schedule jobs during certain windows and be assured their jobs will run.

In a FairShare queue policy, all users are assured to have jobs run irregardless of how many are scheduled by whom or when the jobs arrive in the queue (as in, not First-Come-First-Serve). Each user is assign a priority value which is recalculated on each schedule interval. So a user with only 5 jobs is assured some jobs will run despite another users having 100 jobs in the queue.

Job Arrays

Not supported in Lava. Very handy. Think “shorthand” for job submissions.

bsub -J “myArray[1-1000]” -i “input.%I” -o “output.%i.%J” myJob

This single line submits 1,000 jobs to the queue in a single command. Input files are dynamically specified [%I iterates from 1 to 1000 with step 1], while the output is saved accordingly [%J is the JOBPID of the JobArray].

The jobs will be listed with a common JOBPID like for example 123[1],123[2],123[3]…123[1000] and can be acted upon by addressing the common or individual elements (for example suspend or remove jobs).

Backfill

BACKFILL = Y

Mentioned separatedly because it is such a clever feature in LSF. Queue level configuration option. If all jobs specify a wall clock runtime with bsub -W time … (possible), or the queue defines a default run time for all jobs (unlikely in our environment); then the scheduler is able to mix and match long and short running jobs by optimizing the scheduling events.

So if one job is anticipated to still need 2 hours of run time after having run 10 hours already, a short running job, with a specification for a run time of less than 2 hours, would immediately be scheduled for execution on the same host.

eLIM and eSUB

Note: i honestly don't know if this works in Lava yet.
I expect eLIM will not but eSUB will. Will try both anyway
surprise: eLIM works! Check this page — Henk Meij 2007/09/07 14:16

eLIM, external Load Information Monitor, is a perl or bash script that is a custom program. It extends the resources being monitored by LSF. For example, one of our uses would be to monitor the /sanscratch and /localscratch file systems. We'd write a program that every 30 seconds determines how much free space is available and reports that to the Load Information Monitor (LIM) running on each compute node. That information makes it to the Master Batch Scheduler (MBDSCHD). Our users would then be able to use that information in the resource string requests when submitting a job, like so:

bsub -R “sanscratch>300000 && mem>500” …

… run my job if 300 gb is available in /sanscratch (and reserve it!) and 500 mb of memory is available.

eSUB, external Submission, is a perl or bash script that is a custom program. It evaluates the submission string given to bsub and potentially acts on its contents. One use of this in our environment would be the following. If our Infiniband switch is loaded with non-parallel jobs causing parallel jobs to go into pending status waiting for resources, an eSUB program could intercept job submissions to the 16-ilwnodes queue. It would search for the number of job slots requested and queue name in the submission parameters. If job submissions to the 16-ilwnodes do not request more than 2 cores it is not a prarallel job, thus eSUB could rewrite the destination queue and reroute the submission to queue idle. Beautiful.

For Now

i rest.

But you should also know that LSF can manage heterogeous clusters and physically remote cluster installations. It can management requests such as move this job from this clusterA and clusterB if compatible, providing automatic checkpointing. Not that we would use such, but it's possible. Also, LSF clients could reside on remote desktops which extends the cluster to windows, macos etc.

Resource Reservation

Here is another we ran into. “rusage”, that is resource reservation, is the function which only exists inside LSF, not in Lava.

Arrgh!!

If you create a simple script with the following statement, you are basically requesting your job to run when that specific host has more than 500 MB of memory available. That is for scheduling purposes. Now assume that you know your job will fit inside 300MB of memory. In order to assure that your job keeps running while other jobs are also started on the same host which potentially could consume all memory before your job claims it, you can request to reserve that resource.

#BSUB -R "select[mem>500] rusage[mem=300]"
#BSUB -m nfs-2-2

There are many possibilities with rusage. You can also specify a ramp-up and ramp-down scenario, for example state you need 300MB form the start of your job for one hour, and after that only 100 MB.

bhosts -l host_name will report which resources are reserved on a per host basis. Like so.

[root@swallowtail ~]#: bhosts -l nfs-2-2

HOST  nfs-2-2
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok            240.00     -      8     7       7      0      0      0      -

 CURRENT LOAD USED FOR SCHEDULING:
              r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem
 Total         0.8   1.1   1.1  100%   7.8   139    1  2658 7140M 1593M  418M
 Reserved      0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M  300M
------------------------------------------------------------
...

Notice the line Reserved and the values. Before your job ran, 718 MB was available. But now only 418MB which is what the scheduler will now use when scheduling new jobs.

Back