Table of Contents


Back

⇒ Lava, the scheduler, is not natively capable for parallel jobs submissions. So a wrapper script is necessary. It will obtain the hosts from the LSB_HOSTS variable and build the “machines” file. Follow the TEST link below for detailed information.

⇒ There is a splendid course offered by NCSA at UIUC about MPI. If you're serious about MPI, take it; you can find a link to access this course here

⇒ In all the examples below, man command will provide you with detailed information, like for example man bsub.

Jobs

Infiniband! For non-Infiniband jobs go to Internal Link

PLEASE READ THE 'ENV TEST' SECTION, IT'LL EXPLAIN WHY IT IS COMPLICATED.
Also, you need to test that your environment is set up correctly ⇒ ENV TEST

This write up will only focus on how to submit jobs using scripts, meaning in batch mode. A single bash shell script (they must be bash shells!) will submit myscript to the scheduler.

imyscript

#!/bin/bash

# queue
#BSUB -q idebug -n 16

# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email # standard ouput
#BSUB  -e outfile.err   # standard error

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# run my job
/share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi

echo DONE ... these dirs will be removed via post_exec
echo $MYSANSCRATCH $MYLOCALSCRATCH

# label my job
#BSUB -J myLittleiJob

This looks much like the non-infiniband job submissions but there are some key changes. First i specify a queue with nodes connected to the infiniband switch (idebug). We also specify we will need 16 processors. Queue idebug is comprised of nodes ilogin/ilogin2 each with dual quad CPUs so 2x2x4=16 cores, so we will be using all of them.

The most significant change is that we will be calling a 'wrapper' script. This script mpich-mpirun wraps the program, surprise, mpirun. The reason for this is that the wrapper will build the machines file on the fly.

If you want to use the Local Area Multicomputer (LAM) MPI libraries use the following wrapper script: /share/apps/bin/mpich-mpirun.gnulam

“make today an OpenMPI day”

bsub and bjobs

Straightforward.

[hmeij@swallowtail ~]$ bsub < imyscript
Job <1011> is submitted to queue <idebug>.
[hmeij@swallowtail ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1011    hmeij   PEND  idebug     swallowtail    -        myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1011    hmeij   RUN   idebug     swallowtail\
  compute-1-16:compute-1-16:compute-1-16:compute-1-16:\
  compute-1-16:compute-1-16:compute-1-16:compute-1-16:\
  compute-1-15:compute-1-15:compute-1-15:compute-1-15:\
  compute-1-15:compute-1-15:compute-1-15:compute-1-15 myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs
No unfinished job found

Note: as expected 8 cores (EXEC_HOST) were invoked on each node.

bhist

You can query the scheduler regarding the status of your job.

[hmeij@swallowtail ~]$ bhist -l 1011

Job <1011>, Job Name <myLittleiJob>, User <hmeij>, Project <default>, Command <
                     #!/bin/bash; # queue;#BSUB -q idebug -n 16; # email me (##
                     SUB) or save in $HOME (#SUB);##BSUB -o outfile.email # sta
                     ndard ouput;#BSUB  -e outfile.err   # standard error; # un
                     ique job scratch dirs;MYSANSCRATCH=/sanscratch/$LSB_JOBID>

Thu Apr 19 14:54:19: Submitted from host <swallowtail>, to Queue <idebug>, CWD
                     <$HOME>, Error File <outfile.err>, 16 Processors Requested
                     ;
Thu Apr 19 14:54:24: Dispatched to 16 Hosts/Processors <compute-1-16> <compute-
                     1-16> <compute-1-16> <compute-1-16> <compute-1-16> <comput
                     e-1-16> <compute-1-16> <compute-1-16> <compute-1-15> <comp
                     ute-1-15> <compute-1-15> <compute-1-15> <compute-1-15> <co
                     mpute-1-15> <compute-1-15> <compute-1-15>;
Thu Apr 19 14:54:24: Starting (Pid 6266);
Thu Apr 19 14:54:31: Running with execution home </home/hmeij>, Execution CWD <
                     /home/hmeij>, Execution Pid <6266>;
Thu Apr 19 14:55:47: Done successfully. The CPU time used is 0.0 seconds;
Thu Apr 19 14:55:57: Post job process done successfully;

Summary of time in seconds spent in various states by  Thu Apr 19 14:55:57
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  5        0        83       0        0        0        88

Job Ouput

The above job submission yields …

[hmeij@swallowtail ~]$ cat outfile.err 
Process 11 on compute-1-16.local
Process 6 on compute-1-15.local
Process 14 on compute-1-16.local
Process 0 on compute-1-15.local
Process 1 on compute-1-15.local
Process 2 on compute-1-15.local
Process 3 on compute-1-15.local
Process 8 on compute-1-16.local
Process 4 on compute-1-15.local
Process 9 on compute-1-16.local
Process 5 on compute-1-15.local
Process 10 on compute-1-16.local
Process 7 on compute-1-15.local
Process 12 on compute-1-16.local
Process 13 on compute-1-16.local
Process 15 on compute-1-16.local

and the following email

Job <myLittleiJob> was submitted from host <swallowtail> by user <hmeij>.
Job was executed on host(s) <8*compute-1-16>, in queue <idebug>, as user <hmeij>.
                            <8*compute-1-15>
</home/hmeij> was used as the home directory.
</home/hmeij> was used as the working directory.
Started at Thu Apr 19 14:54:24 2007
Results reported at Thu Apr 19 14:55:47 2007

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash

# queue
#BSUB -q idebug -n 16

# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email # standard ouput
#BSUB  -e outfile.err   # standard error

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# run my job
/share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi

# label my job
#BSUB -J myLittleiJob


------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.05 sec.
    Max Memory :         7 MB
    Max Swap   :       205 MB

    Max Processes  :         5
    Max Threads    :         5

The output (if any) follows:

pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.312946
DONE ... these dirs will be removed via post_exec
/sanscratch/1011 /localscratch/1011

PS:

Read file <outfile.err> for stderr output of this job.

Bingo

When i ran these OpenMPI invocations i was also running a HPLinpack benchmark on the nodes on the infiniband (to assess if the nodes would respond). Follow this to read about the HPLinpack runs.

The idebug queue overrides the job slots set for each node (Max Job Slots = # of cores ⇒ 8). It allows for QJOB_LIMIT=16 and UJOB_LIMIT=16. The benchmark is already running 8 jobs per node. Our job will be asking for 8 more per host. So basically, the host's job slots are exhausted as well as our user limit.

Cute

And so it was.

The Problem

(important i repeat this from another page — Henk Meij 2007/04/19 15:52)

Once you have your binary compiled, you can execute it on the head node or any other node with a hardcoded machines file specified. Like so (look at the code of this script):

[hmeij@swallowtail ~]$/share/apps/bin/cpi.run

This will not work when submitting your program to bsub. Platform reports:

<hi yellow> Lava (your scheduler) is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. </hi>

<hi orange> Also, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts involved in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue. </hi>

And this makes the job submission process for parallel jobs tedious.


Back