User Tools

Site Tools


cluster:30


Back

⇒ Platform/OCS's very good Running Jobs with Platform Lava (read it).

⇒ In all the examples below, man command will provide you with detailed information, like for example man bsub.

Jobs

Non-Infiniband! For Infiniband submissions go to Internal Link

This write up will only focus on how to submit jobs using scripts, meaning in batch mode. There is an interactive mode but in general if you create a script then you have a record of how you submitted your job.

So i'm creating two bash shell scripts (they must be bash shells!). The first myscript will set up the environment and resources needed, the second myjob will contain the actual program i want run and any shell actions needed.

myscript

#!/bin/bash

# queue
#BSUB -q idle

# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email   # standard out
#BSUB  -o outfile.err     # standard error

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# run my job
./myjob one-arg two-arg

# label my job
#BSUB -J myLittleJob

The convention '#BSUB -parameter value' passes command line arguments to bsubman bsub for more information. If you wish to change that behavior add another pound sign like '##BSUB …' and it will be treated as a comment. So in the example above the standard output will be send to me via email (the default behavior) but standard error output (which could be rather large) is written to a file in my home directory when the job finishes.

Other than that, ENV variables are made available to myjob, a queue is defined (actually unnecessary as idle is the default queue) and two command line arguments are passed to myjob. Finally, a cute label is assigned.

myjob

#!/bin/bash

# pre_exec routine will create scratch dirs
# $MYSANSCRATCH   in /sanscratch
# $MYLOCALSCRATCH in /localscratch

# in home directory
for i in `seq 1 25`; 
do 
d=`date`; 
echo "$i $HOSTNAME $2 $1 $d" >> $MYLOCALSCRATCH/outfile; 
done

# retrieve some results
tail $MYLOCALSCRATCH/outfile > $MYSANSCRATCH/outfile2
cp $MYSANSCRATCH/outfile2 ./outfile3.$LSB_JOBID

echo DONE ... these dirs will be removed via post_exec
echo $MYSANSCRATCH $MYLOCALSCRATCH

OK, so my program grabs the date and appends it, with the command line arguments, to a file in the MYLOCALSCRATCH directory. Then it grabs the last 10 lines and copies it to the MYSANSCRATCH directory. Just for fun. Finally we copy that to our home directory for keepers. Then we echo 'DONE' to standard out. Marvelous.

bsub and bjobs

Straightforward.

[hmeij@swallowtail ~]$ bsub < myscript
Job <1001> is submitted to queue <idle>.
[hmeij@swallowtail ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1001    hmeij   PEND  idle       swallowtail    -        myLittleJob Apr 18 11:28
[hmeij@swallowtail ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1001    hmeij   RUN   idle       swallowtail compute-1-14 myLittleJob Apr 18 11:28
[hmeij@swallowtail ~]$ bjobs
No unfinished job found

<hi #ffff00>bjobs can also explain why your job is in PEND status …</hi>

[hmeij@swallowtail gaussian]$ bjobs -p 13892
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
13892   hmeij   PEND  gaussian   swallowtail    -        run101     Aug 29 16:23
 Queue's per-host job slot limit reached: 1 host;

bhist

You can query the scheduler regarding the status of your job.

[hmeij@swallowtail ~]$ bhist -l 1001

Job <1001>, Job Name <myLittleJob>, User <hmeij>, Project <default>, Command <#
                     !/bin/bash; # queue;#BSUB -q idle; # email me (##SUB) or s
                     ave in $HOME (#SUB);##BSUB -o outfile.email # standard oup
                     ut;#BSUB  -e outfile.err   # standard error; # unique job
                     scratch dirs;MYSANSCRATCH=/sanscratch/$LSB_JOBID;MYLOCALS>

Wed Apr 18 11:28:14: Submitted from host <swallowtail>, to Queue <idle>, CWD <$
                     HOME>, Error File <outfile.err>;
Wed Apr 18 11:28:20: Dispatched to <compute-1-14>;
Wed Apr 18 11:28:20: Starting (Pid 21569);
Wed Apr 18 11:28:25: Running with execution home </home/hmeij>, Execution CWD <
                     /home/hmeij>, Execution Pid <21569>;
Wed Apr 18 11:28:25: Done successfully. The CPU time used is 0.0 seconds;
Wed Apr 18 11:28:35: Post job process done successfully;

Summary of time in seconds spent in various states by  Wed Apr 18 11:28:35
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  6        0        5        0        0        0        11

Job Ouput

The above job submission yields …

[hmeij@swallowtail hmeij]# ls -l
...
-rw-r--r--  1 hmeij its  670 Apr 18 11:28 outfile3.1001
-rw-r--r--  1 hmeij its    0 Apr 18 11:18 outfile.err
...

and the following email

Job <myLittleJob> was submitted from host <swallowtail> by user <hmeij>.
Job was executed on host(s) <compute-1-14.local>, in queue <idle>, as user <hmeij>.
</home/hmeij> was used as the home directory.
</home/hmeij> was used as the working directory.
Started at Wed Apr 18 11:28:20 2007
Results reported at Wed Apr 18 11:28:25 2007

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash

# queue
#BSUB -q idle

# email me (##SUB) or save in $HOME (#SUB)
##BSUB -o outfile.email # standard ouput
#BSUB  -e outfile.err   # standard error

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# run my job
./myjob one-arg two-arg

# label my job
#BSUB -J myLittleJob


------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.05 sec.
    Max Memory :         4 MB
    Max Swap   :       117 MB

    Max Processes  :         3
    Max Threads    :         3

The output (if any) follows:

DONE ... these dirs will be removed via post_exec
/sanscratch/1001 /localscratch/1001


PS:

Read file <outfile.err> for stderr output of this job.

b'ees

Other bname utilities for managing your jobs …

bkill JOBID … stops your job

bstop JOBID … suspends your job

bresume JOBID … resumes your job

brequeue JOBID … stops your job and requeues it

brun -m HOSTNAME JOBID … force your job to run (administrators only)

bswitch ALTERNATE_QUEUE JOBID … for pending and running jobs

bpeek JOBID … peek at your job output while it is running

bpeek shows you the tail output of standard output and standard error. As an alternative of this, you can follow the progress of your jobs in the directory ~/.lsbatch. For each job there will be a timestamp.jobpid.err, timestamp.jobpid.out and timestam.jobpid.shell file. Do not remove or edit these files while your job is running.


Back

cluster/30.txt · Last modified: 2007/08/31 10:24 (external edit)