\\
**[[cluster:28|Back]]**
=> Lava, the scheduler, is not natively capable for parallel jobs submissions. So a wrapper script is necessary. It will obtain the hosts from the LSB_HOSTS variable and build the "machines" file. Follow the **TEST** link below for detailed information.
=> There is a splendid course offered by NCSA at UIUC about MPI. If you're serious about MPI, take it; you can find a link to access this course **[[cluster:28|here]]**
=> In all the examples below, ''man //command//'' will provide you with detailed information, like for example ''man bsub''.
===== Jobs =====
Infiniband! For non-Infiniband jobs go to [[cluster:30|Internal Link]]
PLEASE READ THE 'ENV TEST' SECTION, IT'LL EXPLAIN WHY IT IS COMPLICATED. \\
Also, you need to test that your environment is set up correctly => **[[cluster:31|ENV TEST]]** <=
This write up will only focus on how to submit jobs using scripts, meaning in batch mode. A single bash shell script (they must be bash shells!) will submit myscript to the scheduler.
**imyscript**
#!/bin/bash
# queue
#BSUB -q idebug -n 16
# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email # standard ouput
#BSUB -e outfile.err # standard error
# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
# run my job
/share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi
echo DONE ... these dirs will be removed via post_exec
echo $MYSANSCRATCH $MYLOCALSCRATCH
# label my job
#BSUB -J myLittleiJob
This looks much like the non-infiniband job submissions but there are some key changes. First i specify a queue with nodes connected to the infiniband switch (idebug). We also specify we will need 16 processors. Queue idebug is comprised of nodes ilogin/ilogin2 each with dual quad CPUs so 2x2x4=16 cores, so we will be using all of them.
The most significant change is that we will be calling a 'wrapper' script. This script ''mpich-mpirun'' wraps the program, surprise, ''mpirun''. The reason for this is that the wrapper will build the ''machines'' file on the fly.
If you want to use the [[http://www.lam-mpi.org/|Local Area Multicomputer (LAM)]] MPI libraries use the following wrapper script: ''/share/apps/bin/mpich-mpirun.gnulam''
"make today an [[http://www.open-mpi.org/|OpenMPI]] day"
===== bsub and bjobs =====
Straightforward.
[hmeij@swallowtail ~]$ bsub < imyscript
Job <1011> is submitted to queue .
[hmeij@swallowtail ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1011 hmeij PEND idebug swallowtail - myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1011 hmeij RUN idebug swallowtail\
compute-1-16:compute-1-16:compute-1-16:compute-1-16:\
compute-1-16:compute-1-16:compute-1-16:compute-1-16:\
compute-1-15:compute-1-15:compute-1-15:compute-1-15:\
compute-1-15:compute-1-15:compute-1-15:compute-1-15 myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs
No unfinished job found
Note: as expected 8 cores (EXEC_HOST) were invoked on each node.
===== bhist =====
You can query the scheduler regarding the status of your job.
[hmeij@swallowtail ~]$ bhist -l 1011
Job <1011>, Job Name , User , Project , Command <
#!/bin/bash; # queue;#BSUB -q idebug -n 16; # email me (##
SUB) or save in $HOME (#SUB);##BSUB -o outfile.email # sta
ndard ouput;#BSUB -e outfile.err # standard error; # un
ique job scratch dirs;MYSANSCRATCH=/sanscratch/$LSB_JOBID>
Thu Apr 19 14:54:19: Submitted from host , to Queue , CWD
<$HOME>, Error File , 16 Processors Requested
;
Thu Apr 19 14:54:24: Dispatched to 16 Hosts/Processors ;
Thu Apr 19 14:54:24: Starting (Pid 6266);
Thu Apr 19 14:54:31: Running with execution home , Execution CWD <
/home/hmeij>, Execution Pid <6266>;
Thu Apr 19 14:55:47: Done successfully. The CPU time used is 0.0 seconds;
Thu Apr 19 14:55:57: Post job process done successfully;
Summary of time in seconds spent in various states by Thu Apr 19 14:55:57
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
5 0 83 0 0 0 88
===== Job Ouput =====
The above job submission yields ...
[hmeij@swallowtail ~]$ cat outfile.err
Process 11 on compute-1-16.local
Process 6 on compute-1-15.local
Process 14 on compute-1-16.local
Process 0 on compute-1-15.local
Process 1 on compute-1-15.local
Process 2 on compute-1-15.local
Process 3 on compute-1-15.local
Process 8 on compute-1-16.local
Process 4 on compute-1-15.local
Process 9 on compute-1-16.local
Process 5 on compute-1-15.local
Process 10 on compute-1-16.local
Process 7 on compute-1-15.local
Process 12 on compute-1-16.local
Process 13 on compute-1-16.local
Process 15 on compute-1-16.local
and the following email
Job was submitted from host by user .
Job was executed on host(s) <8*compute-1-16>, in queue , as user .
<8*compute-1-15>
was used as the home directory.
was used as the working directory.
Started at Thu Apr 19 14:54:24 2007
Results reported at Thu Apr 19 14:55:47 2007
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
# queue
#BSUB -q idebug -n 16
# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email # standard ouput
#BSUB -e outfile.err # standard error
# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
# run my job
/share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi
# label my job
#BSUB -J myLittleiJob
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 0.05 sec.
Max Memory : 7 MB
Max Swap : 205 MB
Max Processes : 5
Max Threads : 5
The output (if any) follows:
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.312946
DONE ... these dirs will be removed via post_exec
/sanscratch/1011 /localscratch/1011
PS:
Read file for stderr output of this job.
===== Bingo =====
When i ran these OpenMPI invocations i was also running a HPLinpack benchmark on the nodes on the infiniband (to assess if the nodes would respond). **[[cluster:26|Follow this to read about the HPLinpack runs.]]**
The idebug queue overrides the job slots set for each node (Max Job Slots = # of cores => 8). It allows for QJOB_LIMIT=16 and UJOB_LIMIT=16. The benchmark is already running 8 jobs per node. Our job will be asking for 8 more per host. So basically, the host's job slots are exhausted as well as our user limit.
{{:cluster:cpi.gif|Cute}}
And so it was.\\
===== The Problem =====
(important i repeat this from another page --- //[[hmeij@wesleyan.edu|Henk Meij]] 2007/04/19 15:52//)
Once you have your binary compiled, you can execute it on the head node or any other node with a hardcoded ''machines'' file specified. Like so (look at the code of this script):
[hmeij@swallowtail ~]$/share/apps/bin/cpi.run
This will not work when submitting your program to ''bsub''. Platform reports:
Lava (your scheduler) is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution.
Also, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts involved in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue.
And this makes the job submission process for parallel jobs tedious.
\\
**[[cluster:28|Back]]**