User Tools

Site Tools


cluster:30

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:30 [2007/08/31 10:24]
cluster:30 [2007/08/31 10:24] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​28|Back]]**
  
 +=> Platform/​OCS'​s **very good** {{:​cluster:​lava_using_6.1.pdf|Running Jobs with Platform Lava}} (read it). 
 +
 +=> In all the examples below, ''​man //​command//''​ will provide you with detailed information,​ like for example ''​man bsub''​.
 +
 +
 +
 +
 +
 +
 +===== Jobs =====
 +
 +Non-Infiniband! For Infiniband submissions go to [[cluster:​32|Internal Link]]
 +
 +This write up will only focus on how to submit jobs using scripts, meaning in batch mode.  There is an interactive mode but in general if you create a script then you have a record of how you submitted your job.
 +
 +So i'm creating two bash shell scripts (they must be bash shells!). ​ The first **myscript** will set up the environment and resources needed, the second **myjob** will contain the actual program i want run and any shell actions needed.
 +
 +**myscript**
 +<​code>​
 +#!/bin/bash
 +
 +# queue
 +#BSUB -q idle
 +
 +# email me (##SUB) or save in $HOME (#SUB)
 +##BSUB -o outfile.email ​  # standard out
 +#BSUB  -o outfile.err ​    # standard error
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/​sanscratch/​$LSB_JOBID
 +MYLOCALSCRATCH=/​localscratch/​$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +
 +# run my job
 +./myjob one-arg two-arg
 +
 +# label my job
 +#BSUB -J myLittleJob
 +</​code>​
 +
 +The convention '#BSUB -parameter value' passes command line arguments to ''​bsub''​ ... ''​man bsub''​ for more information. ​ If you wish to change that behavior add another pound sign like '##​BSUB ...' and it will be treated as a comment. So in the example above the standard output will be send to me via email (the default behavior) but standard error output (which could be rather large) is written to a file in my home directory when the job finishes.
 +
 +Other than that, ENV variables are made available to **myjob**, a queue is defined (actually unnecessary as idle is the default queue) and two command line arguments are passed to **myjob**. ​ Finally, a cute label is assigned.
 +
 +**myjob**
 +<​code>​
 +#!/bin/bash
 +
 +# pre_exec routine will create scratch dirs
 +# $MYSANSCRATCH ​  in /sanscratch
 +# $MYLOCALSCRATCH in /​localscratch
 +
 +# in home directory
 +for i in `seq 1 25`; 
 +do 
 +d=`date`; ​
 +echo "$i $HOSTNAME $2 $1 $d" >> $MYLOCALSCRATCH/​outfile; ​
 +done
 +
 +# retrieve some results
 +tail $MYLOCALSCRATCH/​outfile > $MYSANSCRATCH/​outfile2
 +cp $MYSANSCRATCH/​outfile2 ./​outfile3.$LSB_JOBID
 +
 +echo DONE ... these dirs will be removed via post_exec
 +echo $MYSANSCRATCH $MYLOCALSCRATCH
 +</​code>​
 +
 +OK, so my program grabs the date and appends it, with the command line arguments, to a file in the MYLOCALSCRATCH directory. ​ Then it grabs the last 10 lines and copies it to the MYSANSCRATCH directory. Just for fun.  Finally we copy that to our home directory for keepers. ​ Then we echo '​DONE'​ to standard out. Marvelous.
 +
 +
 +
 +
 +===== bsub and bjobs =====
 +
 +Straightforward.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bsub < myscript
 +Job <​1001>​ is submitted to queue <​idle>​.
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +1001    hmeij   ​PEND ​ idle       ​swallowtail ​   -        myLittleJob Apr 18 11:28
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +1001    hmeij   ​RUN ​  ​idle ​      ​swallowtail compute-1-14 myLittleJob Apr 18 11:28
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +No unfinished job found
 +</​code>​
 +
 +<hi #​ffff00>''​bjobs''​ can also explain why your job is in PEND status ...</​hi>​
 +
 +<​code>​
 +[hmeij@swallowtail gaussian]$ bjobs -p 13892
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +13892   ​hmeij ​  ​PEND ​ gaussian ​  ​swallowtail ​   -        run101 ​    Aug 29 16:23
 + ​Queue'​s per-host job slot limit reached: 1 host;
 +</​code>​
 +
 +===== bhist =====
 +
 +You can query the scheduler regarding the status of your job.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bhist -l 1001
 +
 +Job <​1001>,​ Job Name <​myLittleJob>,​ User <​hmeij>,​ Project <​default>,​ Command <#
 +                     ​!/​bin/​bash;​ # queue;#BSUB -q idle; # email me (##SUB) or s
 +                     ave in $HOME (#​SUB);##​BSUB -o outfile.email # standard oup
 +                     ​ut;#​BSUB ​ -e outfile.err ​  # standard error; # unique job
 +                     ​scratch dirs;​MYSANSCRATCH=/​sanscratch/​$LSB_JOBID;​MYLOCALS>​
 +
 +Wed Apr 18 11:28:14: Submitted from host <​swallowtail>,​ to Queue <​idle>,​ CWD <$
 +                     ​HOME>,​ Error File <​outfile.err>;​
 +Wed Apr 18 11:28:20: Dispatched to <​compute-1-14>;​
 +Wed Apr 18 11:28:20: Starting (Pid 21569);
 +Wed Apr 18 11:28:25: Running with execution home </​home/​hmeij>,​ Execution CWD <
 +                     /​home/​hmeij>,​ Execution Pid <​21569>;​
 +Wed Apr 18 11:28:25: Done successfully. The CPU time used is 0.0 seconds;
 +Wed Apr 18 11:28:35: Post job process done successfully;​
 +
 +Summary of time in seconds spent in various states by  Wed Apr 18 11:28:35
 +  PEND     ​PSUSP ​   RUN      USUSP    SSUSP    UNKWN    TOTAL
 +  6        0        5        0        0        0        11
 +</​code>​
 +
 +
 +===== Job Ouput =====
 +
 +The above job submission yields ...
 +
 +<​code>​
 +[hmeij@swallowtail hmeij]# ls -l
 +...
 +-rw-r--r-- ​ 1 hmeij its  670 Apr 18 11:28 outfile3.1001
 +-rw-r--r-- ​ 1 hmeij its    0 Apr 18 11:18 outfile.err
 +...
 +</​code>​
 +
 +and the following email
 +
 +<​code>​
 +Job <​myLittleJob>​ was submitted from host <​swallowtail>​ by user <​hmeij>​.
 +Job was executed on host(s) <​compute-1-14.local>,​ in queue <​idle>,​ as user <​hmeij>​.
 +</​home/​hmeij>​ was used as the home directory.
 +</​home/​hmeij>​ was used as the working directory.
 +Started at Wed Apr 18 11:28:20 2007
 +Results reported at Wed Apr 18 11:28:25 2007
 +
 +Your job looked like:
 +
 +------------------------------------------------------------
 +# LSBATCH: User input
 +#!/bin/bash
 +
 +# queue
 +#BSUB -q idle
 +
 +# email me (##SUB) or save in $HOME (#SUB)
 +##BSUB -o outfile.email # standard ouput
 +#BSUB  -e outfile.err ​  # standard error
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/​sanscratch/​$LSB_JOBID
 +MYLOCALSCRATCH=/​localscratch/​$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +
 +# run my job
 +./myjob one-arg two-arg
 +
 +# label my job
 +#BSUB -J myLittleJob
 +
 +
 +------------------------------------------------------------
 +
 +Successfully completed.
 +
 +Resource usage summary:
 +
 +    CPU time   : ​     0.05 sec.
 +    Max Memory :         4 MB
 +    Max Swap   : ​      117 MB
 +
 +    Max Processes ​ :         3
 +    Max Threads ​   :         3
 +
 +The output (if any) follows:
 +
 +DONE ... these dirs will be removed via post_exec
 +/​sanscratch/​1001 /​localscratch/​1001
 +
 +
 +PS:
 +
 +Read file <​outfile.err>​ for stderr output of this job.
 +</​code>​
 +
 +
 +
 +
 +
 +
 +===== b'ees =====
 +
 +Other ''​b//​__name__//''​ utilities for managing your jobs ...
 +
 +**''​bkill''​** JOBID ... stops your job
 +
 +**''​bstop''​** JOBID ... suspends your job
 +
 +**''​bresume''​** JOBID ... resumes your job
 +
 +**''​brequeue''​** JOBID ... stops your job and requeues it
 +
 +**''​brun''​** -m HOSTNAME JOBID ... force your job to run (administrators only)
 +
 +**''​bswitch''​** ALTERNATE_QUEUE JOBID ... for pending and running jobs
 +
 +**''​bpeek''​** JOBID ... peek at your job output while it is running
 +
 +=> ''​bpeek''​ shows you the ''​tail''​ output of standard output and standard error. ​ As an alternative of this, you can follow the progress of your jobs in the directory ~/​.lsbatch. ​ For each job there will be a timestamp.jobpid.err,​ timestamp.jobpid.out and timestam.jobpid.shell file.  Do not remove or edit these files while your job is running.
 +
 +
 +\\
 +**[[cluster:​28|Back]]**
cluster/30.txt ยท Last modified: 2007/08/31 10:24 (external edit)