Table of Contents


Home

Webcast Demo of Platform/OCS by Platform Computing

William DeSalvo, from Platform Computing, did a webcast presentation about Platform/OCS … the administrative software layer of our cluster design. Several documents were obtained detailing administrative aspects of the Platform/OCS software stack (see below).

Several interesting tidbits surfaced

* Lava (read PBS) and LSF do recognize quad cores. Normally a “jobslot” is defined inside the Lava or LSF configuration, one per processor. However, by increasing the number of jobslots the scheduler can be made aware of the total number of cores available for scheduling, if desired.

* “esub” (aka “job submission filter”) is available for Lava. This script will allow you to alter job parameters, including queue etc. What this implies is that a “routing queue” could be setup. A routing queue allows users to schedule jobs and defined needed parameters but let the scheduler figure out which queue to submit it to. It “routs” the job to best available queue either by looking at the best fit or by evaluating logic such as potentially backfilling, prioritization based on user status etc. It is a way to extend the Lava functionality if needed. LSF ofcourse has all the functions build in. A routing queue could be useful to let users schedule jobs and not have to worry about which queue to specify.

* Preemption and Reservation are not available in Lava (but are in LSF). Preemption allow for suspension of low priority jobs when high priority jobs are submitted. Reservation allows for the scheduling of jobs which would be unlikely to be executed given the queue configurations. For example, the 32 light weight nodes will probably be split in 16 on the gigabit ethernet switch and 16 on the infiniband switch. A job that requires all 32 light weight nodes, given other job scheduling, would unlikely be scheduled. Reservation makes it possible to defined a period for execution of such jobs at some time in the future, blocking other jobs/queues for being scheduled.

* Clumon, a monitoring tool like Ganglia, is build into Platform/OCS. For example, view the 1,280 node Tungsten Cluster at NCSA External Link. This is really an interesting monitor and gives a pretty good overview of what the cluster is doing. Check it out.

Training is available and two courses are of interest

No dates are available for 2007, but plans are for an early April session. “In our experience, we have found it best when we had a few universities interested in training, and pulled them together for a formal onsite training. We are planning a course of this nature in early April” [Jenny Yam, Training Coordinator, Platform Computing]

Documentation


Home