User Tools

Site Tools


cluster:59

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:59 [2008/01/09 14:04] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​0|Back]]**
 +
 +
 +====== Complete Documentation ======
 +
 +It's all at this link **[[https://​dokuwiki.wesleyan.edu/​doku.php?​id=cluster:​58|COMPLETE DOCUMENTATION FOR LSF/HPC 6.2]]** and very good.
 +
 +
 +
 +
 +====== New Features in LSF 6.2 ======
 +
 +This page will be expanded to show examples of LSF/HPC advanced features.
 +
 +The more information you can provide to the scheduler regarding run times, resources needed and when, the more efficient the scheduling will be.  The examples below are just made up scenarios. ​ Try to get familiar with them or ask for hands-on working sessions.
 +
 +
 +=> Also read up on the new queue configurations:​ **[[cluster:​29|Link]]**
 +
 +
 +As part of the upgrade:
 +
 +  * Jobs were terminated ... for a list of which ones view [[http://​swallowtail.wesleyan.edu/​clumon/​jobs-killed.php|External Link]]
 +
 +  * The working directories of those terminated jobs were saved in **/​sanscratch/​OLDJOBS**,​ help your self ...
 +
 +  * When the new scheduler came online it started with JOBPID 101 ... that may clobber some of your old output files so i've spooled the JOBPIDs forward to 30,000.
 +
 +  * Some home directories have been relocated but /​home///​username//​ remains the same, fyi.
 +
 +  * **Parallel job submission syntax has/will change!** However, the "old way" still works. ​ See below, i should have the documentation updated shortly. ​ This will primarily affect the Amber users (who like to use multiple hosts), but not the Gaussian users (who like to use a single host).
 +
 +  * We're still experiencing license issues ... more later.
 +
 +
 +
 +===== Exclusive =====
 +
 +If you wish to use a compute node in an "​exclusive"​ mode us the **''​bsub -x ...''​** syntax. ​ You may wish to do this for example if you want all the memory available to your job.  Or you want all the cores. ​ Note that in either case resources are "​wasted";​ if you allocated all the memory, cores may go idle, if you request all the cores, memory may go unused. ​ Try to match your needs with the host resources.
 +
 +Here is how it works, in your program ...
 +
 +<​code>​
 +#BSUB -q elw 
 +#BSUB -x 
 +#BSUB -J "​myLittleJob"​
 +</​code>​
 +
 +Once your job runs ...
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bhosts
 +HOST_NAME ​         STATUS ​      ​JL/​U ​   MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
 +compute-1-18 ​      ​closed ​         -      8      1      1      0      0      0
 +</​code> ​
 +
 +you will notice that the host status is now "​closed"​ and runs 1 job.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bhosts -l compute-1-18
 +HOST  compute-1-18
 +STATUS ​          ​CPUF ​ JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
 +closed_Excl ​   240.00 ​    ​- ​     8      1      1      0      0      0      -
 +</​code>​
 +
 +Please note that the ''​matlab''​ queue does not support this.  There are other ways to obtain exclusive use ofcourse:
 +
 +  * for serial jobs ''​bsub -n 8 ...''​ requests all the cores on a single host.
 +  * ''​bsub -R ..''​ , see below.
 +
 +
 +
 +
 +===== Resource Reservation =====
 +
 +**''​bsub -R "​resource_string"​ ...''​**
 +
 +Very powerful argument you can give to ''​bsub''​ ... for a detailed description read  **[[http://​lsfdocs.wesleyan.edu/​lsf6.2_using/​B_jobops.html#​230606|External Link]]**.
 +
 +Here is a simple example. A simple script, we're going to ask for 200 MB of memory.
 +
 +<​code>​
 +...
 +# queue
 +#BSUB -q elw
 +#BSUB -R "​rusage[mem=200]"​
 +#BSUB -J "​myLittleJob"​
 +...
 +</​code>​
 +
 +Submit job and observe the resource reservation (note the value under "​mem"​ in the "​Reserved"​ line). ​ Any new jobs that would be submitted to this host can, while this job is running, only ask for a maximum of 3660M - 200M = 3460M. ​ The scheduler will handle all this for you.
 +
 +There are many, many options using the resource reservation options. ​ You can introduce time based decay or accumulate behavior for resources. ​ Read the **External Link** material above.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bsub < ./​myscript ​
 +Job <​30238>​ is submitted to queue <​elw>​.
 +
 +[hmeij@swallowtail ~]$ bjobs
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +30238   ​hmeij ​  ​RUN ​  ​elw ​       swallowtail compute-1-21 myLittleJob Nov 20 10:10
 +
 +[hmeij@swallowtail ~]$ bhosts -l compute-1-21
 +HOST  compute-1-21
 +STATUS ​          ​CPUF ​ JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
 +ok             ​240.00 ​    ​- ​     8      1      1      0      0      0      -
 +
 + ​CURRENT LOAD USED FOR SCHEDULING:
 +              r15s   ​r1m ​ r15m    ut    pg    io   ​ls ​   it   ​tmp ​  ​swp ​  mem gm_ports
 + ​Total ​        ​0.0 ​  ​0.0 ​  ​0.0 ​   0%   ​1.7 ​  ​127 ​   0  1169 7116M 4000M 3660M      0.0
 + ​Reserved ​     0.0   ​0.0 ​  ​0.0 ​   0%   ​0.0 ​    ​0 ​   0     ​0 ​   0M    0M  200M       -
 +...
 +</​code>​
 +
 +There are two custom resources that have been defined outside of LSF.  They are '​localscratch'​ and '​sanscratch'​ and the values represent the amount of free disk space. ​ With the ''​rusage''​ option you similarly can reserve disk space for your job and avoid conflicts with other jobs.
 +
 +Remember that /​localscratch is local to the individual compute nodes and is roughly 70GB for all nodes except for the heavy weight nodes with the attached MD1000 disk arrays. ​ The latter nodes, nfs-2-1 ... nfs-2-4 (the ''​ehwfd''​ queue) have roughly 230 GB of /​localscratch available. ​ The /sanscratch file system is shared by all nodes.
 +
 +
 +
 +
 +
 +===== Wall Clock Time =====
 +
 +Not a //new// feature, but one which i strongly encourage you to use.  \\
 +Queue policy of BACKFILL //is a new option//, defined at queue level.
 +
 +With wall clock time information available for each job, the scheduler is able to exercise the BACKFILL policy. ​ That is, if job A for example still has 6 hours to run and a job slot is available on that host, the scheduler will assign higher priorities to other jobs that can run on that host within 6 hours. ​ The key here is that those unused job slots may be reserved for job B, scheduled to run once Job A finishes.
 +
 +To specify ...
 +
 +<​code>​
 +#BSUB -W hours:​minutes
 +</​code>​
 +
 +For efficient backfilling,​ the queues should have a default RUNLIMIT defined. ​ However, we do not apply this.  Thus backfilling can only happen when users specify the **-W** option during job submission. ​ Jobs that exceed these limits are terminated automatically.
 +
 +
 +===== Parallel Jobs =====
 +
 +
 +==== Old Way ====
 +
 +Good news!  It appears the "old way" of submitting jobs still works. ​ That is with the use of the "​mpirun" ​ wrapper scripts. ​ This method is not recommended because once submitted, LSF has no knowledge of the parallel tasks. ​ But it still works, so in a pinch use your old scripts.
 +
 +
 +
 +==== Spanning ====
 +
 +A very handy feature. ​ You may have to experiment with the impact on your jobs.  Basically, if we ask for 16 jobslots we can dictate to the scheduler how many we want per node.  Previously, the scheduler would fill up one host, then move to the next host etc.
 +
 +But consider ... 16 jobslots (cores) are requested and we want no more than 2 allocated per host. The resource request ''​span''​ instructs the scheduler to tile the parallel tasks across multiple hosts. ​ So submit and observe the allocation.
 +
 +<​code>​
 +#!/bin/bash
 +#BSUB -q imw
 +#BSUB -n 16
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +#BSUB -R "​span[ptile=2]"​
 +...
 +</​code>​
 +
 +<​code>​
 +
 +[hmeij@swallowtail cd]$ bjobs
 +
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +30244   ​hmeij ​  ​RUN ​  ​imw ​       swallowtail 2*compute-1-13:​2*compute-1-14:​
 +2*compute-1-8:​2*compute-1-10:​2*compute-1-4:​2*compute-1-9:​2*compute-1-16:​
 +2*compute-1-7 test       Nov 20 11:04
 +
 +</​code>​
 +
 +This also works with the "old way" of submitting ;-)\\
 +Some jobs will benefit from this tremendously,​ others may not.
 +
 +
 +==== New Way ====
 +
 +Lets start a **[[cluster:​64|new page]]**.
 +
 + --- //​[[hmeij@wesleyan.edu|Meij,​ Henk]] 2008/01/09 11:31//
 +
 +\\
 +**[[cluster:​0|Back]]**
  
cluster/59.txt ยท Last modified: 2008/01/09 14:04 (external edit)