⇒ Platform/OCS's very good Running Jobs with Platform Lava (read it).
⇒ In all the examples below, man command
will provide you with detailed information, like for example man bsub
.
Non-Infiniband! For Infiniband submissions go to Internal Link
This write up will only focus on how to submit jobs using scripts, meaning in batch mode. There is an interactive mode but in general if you create a script then you have a record of how you submitted your job.
So i'm creating two bash shell scripts (they must be bash shells!). The first myscript will set up the environment and resources needed, the second myjob will contain the actual program i want run and any shell actions needed.
myscript
#!/bin/bash # queue #BSUB -q idle # email me (##SUB) or save in $HOME (#SUB) ##BSUB -o outfile.email # standard out #BSUB -o outfile.err # standard error # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # run my job ./myjob one-arg two-arg # label my job #BSUB -J myLittleJob
The convention '#BSUB -parameter value' passes command line arguments to bsub
… man bsub
for more information. If you wish to change that behavior add another pound sign like '##BSUB …' and it will be treated as a comment. So in the example above the standard output will be send to me via email (the default behavior) but standard error output (which could be rather large) is written to a file in my home directory when the job finishes.
Other than that, ENV variables are made available to myjob, a queue is defined (actually unnecessary as idle is the default queue) and two command line arguments are passed to myjob. Finally, a cute label is assigned.
myjob
#!/bin/bash # pre_exec routine will create scratch dirs # $MYSANSCRATCH in /sanscratch # $MYLOCALSCRATCH in /localscratch # in home directory for i in `seq 1 25`; do d=`date`; echo "$i $HOSTNAME $2 $1 $d" >> $MYLOCALSCRATCH/outfile; done # retrieve some results tail $MYLOCALSCRATCH/outfile > $MYSANSCRATCH/outfile2 cp $MYSANSCRATCH/outfile2 ./outfile3.$LSB_JOBID echo DONE ... these dirs will be removed via post_exec echo $MYSANSCRATCH $MYLOCALSCRATCH
OK, so my program grabs the date and appends it, with the command line arguments, to a file in the MYLOCALSCRATCH directory. Then it grabs the last 10 lines and copies it to the MYSANSCRATCH directory. Just for fun. Finally we copy that to our home directory for keepers. Then we echo 'DONE' to standard out. Marvelous.
Straightforward.
[hmeij@swallowtail ~]$ bsub < myscript Job <1001> is submitted to queue <idle>.
[hmeij@swallowtail ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1001 hmeij PEND idle swallowtail - myLittleJob Apr 18 11:28
[hmeij@swallowtail ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1001 hmeij RUN idle swallowtail compute-1-14 myLittleJob Apr 18 11:28
[hmeij@swallowtail ~]$ bjobs No unfinished job found
<hi #ffff00>bjobs
can also explain why your job is in PEND status …</hi>
[hmeij@swallowtail gaussian]$ bjobs -p 13892 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 13892 hmeij PEND gaussian swallowtail - run101 Aug 29 16:23 Queue's per-host job slot limit reached: 1 host;
You can query the scheduler regarding the status of your job.
[hmeij@swallowtail ~]$ bhist -l 1001 Job <1001>, Job Name <myLittleJob>, User <hmeij>, Project <default>, Command <# !/bin/bash; # queue;#BSUB -q idle; # email me (##SUB) or s ave in $HOME (#SUB);##BSUB -o outfile.email # standard oup ut;#BSUB -e outfile.err # standard error; # unique job scratch dirs;MYSANSCRATCH=/sanscratch/$LSB_JOBID;MYLOCALS> Wed Apr 18 11:28:14: Submitted from host <swallowtail>, to Queue <idle>, CWD <$ HOME>, Error File <outfile.err>; Wed Apr 18 11:28:20: Dispatched to <compute-1-14>; Wed Apr 18 11:28:20: Starting (Pid 21569); Wed Apr 18 11:28:25: Running with execution home </home/hmeij>, Execution CWD < /home/hmeij>, Execution Pid <21569>; Wed Apr 18 11:28:25: Done successfully. The CPU time used is 0.0 seconds; Wed Apr 18 11:28:35: Post job process done successfully; Summary of time in seconds spent in various states by Wed Apr 18 11:28:35 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 6 0 5 0 0 0 11
The above job submission yields …
[hmeij@swallowtail hmeij]# ls -l ... -rw-r--r-- 1 hmeij its 670 Apr 18 11:28 outfile3.1001 -rw-r--r-- 1 hmeij its 0 Apr 18 11:18 outfile.err ...
and the following email
Job <myLittleJob> was submitted from host <swallowtail> by user <hmeij>. Job was executed on host(s) <compute-1-14.local>, in queue <idle>, as user <hmeij>. </home/hmeij> was used as the home directory. </home/hmeij> was used as the working directory. Started at Wed Apr 18 11:28:20 2007 Results reported at Wed Apr 18 11:28:25 2007 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #!/bin/bash # queue #BSUB -q idle # email me (##SUB) or save in $HOME (#SUB) ##BSUB -o outfile.email # standard ouput #BSUB -e outfile.err # standard error # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # run my job ./myjob one-arg two-arg # label my job #BSUB -J myLittleJob ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.05 sec. Max Memory : 4 MB Max Swap : 117 MB Max Processes : 3 Max Threads : 3 The output (if any) follows: DONE ... these dirs will be removed via post_exec /sanscratch/1001 /localscratch/1001 PS: Read file <outfile.err> for stderr output of this job.
Other bname
utilities for managing your jobs …
bkill
JOBID … stops your job
bstop
JOBID … suspends your job
bresume
JOBID … resumes your job
brequeue
JOBID … stops your job and requeues it
brun
-m HOSTNAME JOBID … force your job to run (administrators only)
bswitch
ALTERNATE_QUEUE JOBID … for pending and running jobs
bpeek
JOBID … peek at your job output while it is running
⇒ bpeek
shows you the tail
output of standard output and standard error. As an alternative of this, you can follow the progress of your jobs in the directory ~/.lsbatch. For each job there will be a timestamp.jobpid.err, timestamp.jobpid.out and timestam.jobpid.shell file. Do not remove or edit these files while your job is running.