⇒ Lava, the scheduler, is not natively capable for parallel jobs submissions. So a wrapper script is necessary. It will obtain the hosts from the LSB_HOSTS variable and build the “machines” file. Follow the TEST link below for detailed information.
⇒ There is a splendid course offered by NCSA at UIUC about MPI. If you're serious about MPI, take it; you can find a link to access this course here
⇒ In all the examples below,
man command will provide you with detailed information, like for example
Infiniband! For non-Infiniband jobs go to Internal Link
PLEASE READ THE 'ENV TEST' SECTION, IT'LL EXPLAIN WHY IT IS COMPLICATED.
Also, you need to test that your environment is set up correctly ⇒ ENV TEST ⇐
This write up will only focus on how to submit jobs using scripts, meaning in batch mode. A single bash shell script (they must be bash shells!) will submit myscript to the scheduler.
#!/bin/bash # queue #BSUB -q idebug -n 16 # email me (##SUB) or save in $HOME (#SUB) ##BSUB -o outfile.email # standard ouput #BSUB -e outfile.err # standard error # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # run my job /share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi echo DONE ... these dirs will be removed via post_exec echo $MYSANSCRATCH $MYLOCALSCRATCH # label my job #BSUB -J myLittleiJob
This looks much like the non-infiniband job submissions but there are some key changes. First i specify a queue with nodes connected to the infiniband switch (idebug). We also specify we will need 16 processors. Queue idebug is comprised of nodes ilogin/ilogin2 each with dual quad CPUs so 2x2x4=16 cores, so we will be using all of them.
The most significant change is that we will be calling a 'wrapper' script. This script
mpich-mpirun wraps the program, surprise,
mpirun. The reason for this is that the wrapper will build the
machines file on the fly.
If you want to use the Local Area Multicomputer (LAM) MPI libraries use the following wrapper script:
“make today an OpenMPI day”
[hmeij@swallowtail ~]$ bsub < imyscript Job <1011> is submitted to queue <idebug>.
[hmeij@swallowtail ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1011 hmeij PEND idebug swallowtail - myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1011 hmeij RUN idebug swallowtail\ compute-1-16:compute-1-16:compute-1-16:compute-1-16:\ compute-1-16:compute-1-16:compute-1-16:compute-1-16:\ compute-1-15:compute-1-15:compute-1-15:compute-1-15:\ compute-1-15:compute-1-15:compute-1-15:compute-1-15 myLittleiJob Apr 19 14:54
[hmeij@swallowtail ~]$ bjobs No unfinished job found
Note: as expected 8 cores (EXEC_HOST) were invoked on each node.
You can query the scheduler regarding the status of your job.
[hmeij@swallowtail ~]$ bhist -l 1011 Job <1011>, Job Name <myLittleiJob>, User <hmeij>, Project <default>, Command < #!/bin/bash; # queue;#BSUB -q idebug -n 16; # email me (## SUB) or save in $HOME (#SUB);##BSUB -o outfile.email # sta ndard ouput;#BSUB -e outfile.err # standard error; # un ique job scratch dirs;MYSANSCRATCH=/sanscratch/$LSB_JOBID> Thu Apr 19 14:54:19: Submitted from host <swallowtail>, to Queue <idebug>, CWD <$HOME>, Error File <outfile.err>, 16 Processors Requested ; Thu Apr 19 14:54:24: Dispatched to 16 Hosts/Processors <compute-1-16> <compute- 1-16> <compute-1-16> <compute-1-16> <compute-1-16> <comput e-1-16> <compute-1-16> <compute-1-16> <compute-1-15> <comp ute-1-15> <compute-1-15> <compute-1-15> <compute-1-15> <co mpute-1-15> <compute-1-15> <compute-1-15>; Thu Apr 19 14:54:24: Starting (Pid 6266); Thu Apr 19 14:54:31: Running with execution home </home/hmeij>, Execution CWD < /home/hmeij>, Execution Pid <6266>; Thu Apr 19 14:55:47: Done successfully. The CPU time used is 0.0 seconds; Thu Apr 19 14:55:57: Post job process done successfully; Summary of time in seconds spent in various states by Thu Apr 19 14:55:57 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 5 0 83 0 0 0 88
The above job submission yields …
[hmeij@swallowtail ~]$ cat outfile.err Process 11 on compute-1-16.local Process 6 on compute-1-15.local Process 14 on compute-1-16.local Process 0 on compute-1-15.local Process 1 on compute-1-15.local Process 2 on compute-1-15.local Process 3 on compute-1-15.local Process 8 on compute-1-16.local Process 4 on compute-1-15.local Process 9 on compute-1-16.local Process 5 on compute-1-15.local Process 10 on compute-1-16.local Process 7 on compute-1-15.local Process 12 on compute-1-16.local Process 13 on compute-1-16.local Process 15 on compute-1-16.local
and the following email
Job <myLittleiJob> was submitted from host <swallowtail> by user <hmeij>. Job was executed on host(s) <8*compute-1-16>, in queue <idebug>, as user <hmeij>. <8*compute-1-15> </home/hmeij> was used as the home directory. </home/hmeij> was used as the working directory. Started at Thu Apr 19 14:54:24 2007 Results reported at Thu Apr 19 14:55:47 2007 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #!/bin/bash # queue #BSUB -q idebug -n 16 # email me (##SUB) or save in $HOME (#SUB) ##BSUB -o outfile.email # standard ouput #BSUB -e outfile.err # standard error # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # run my job /share/apps/bin/mpich-mpirun -np 16 /share/apps/openmpi-1.2/bin/cpi # label my job #BSUB -J myLittleiJob ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.05 sec. Max Memory : 7 MB Max Swap : 205 MB Max Processes : 5 Max Threads : 5 The output (if any) follows: pi is approximately 3.1416009869231245, Error is 0.0000083333333314 wall clock time = 0.312946 DONE ... these dirs will be removed via post_exec /sanscratch/1011 /localscratch/1011 PS: Read file <outfile.err> for stderr output of this job.
When i ran these OpenMPI invocations i was also running a HPLinpack benchmark on the nodes on the infiniband (to assess if the nodes would respond). Follow this to read about the HPLinpack runs.
The idebug queue overrides the job slots set for each node (Max Job Slots = # of cores ⇒ 8). It allows for QJOB_LIMIT=16 and UJOB_LIMIT=16. The benchmark is already running 8 jobs per node. Our job will be asking for 8 more per host. So basically, the host's job slots are exhausted as well as our user limit.
And so it was.
(important i repeat this from another page — Henk Meij 2007/04/19 15:52)
Once you have your binary compiled, you can execute it on the head node or any other node with a hardcoded
machines file specified. Like so (look at the code of this script):
This will not work when submitting your program to
bsub. Platform reports:
<hi yellow> Lava (your scheduler) is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. </hi>
<hi orange> Also, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts involved in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue. </hi>
And this makes the job submission process for parallel jobs tedious.