User Tools

Site Tools


cluster:32

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:32 [2007/05/16 11:27] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​28|Back]]**
  
 +=> Lava, the scheduler, is not natively capable for parallel jobs submissions. ​ So a wrapper script is necessary. ​ It will obtain the hosts from the LSB_HOSTS variable and build the "​machines"​ file. Follow the **TEST** link below for detailed information.
 +
 +=> There is a splendid course offered by NCSA at UIUC about MPI.  If you're serious about MPI, take it; you can find a link to access this course **[[cluster:​28|here]]**
 +
 +=> In all the examples below, ''​man //​command//''​ will provide you with detailed information,​ like for example ''​man bsub''​.
 +
 +
 +
 +
 +===== Jobs =====
 +
 +Infiniband! ​ For non-Infiniband jobs go to [[cluster:​30|Internal Link]]
 +
 +PLEASE READ THE 'ENV TEST' SECTION, IT'LL EXPLAIN WHY IT IS COMPLICATED. ​ \\
 +Also, you need to test that your environment is set up correctly ​ => **[[cluster:​31|ENV TEST]]** <=
 +
 +This write up will only focus on how to submit jobs using scripts, meaning in batch mode. A single bash shell script (they must be bash shells!) will submit myscript to the scheduler.
 +
 +**imyscript**
 +<​code>​
 +#!/bin/bash
 +
 +# queue
 +#BSUB -q idebug -n 16
 +
 +# email me (##SUB) or save in $HOME (#SUB)
 +##BSUB -o outfile.email # standard ouput
 +#BSUB  -e outfile.err ​  # standard error
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/​sanscratch/​$LSB_JOBID
 +MYLOCALSCRATCH=/​localscratch/​$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +
 +# run my job
 +/​share/​apps/​bin/​mpich-mpirun -np 16 /​share/​apps/​openmpi-1.2/​bin/​cpi
 +
 +echo DONE ... these dirs will be removed via post_exec
 +echo $MYSANSCRATCH $MYLOCALSCRATCH
 +
 +# label my job
 +#BSUB -J myLittleiJob
 +</​code>​
 +
 +This looks much like the non-infiniband job submissions but there are some key changes. ​ First i specify a queue with nodes connected to the infiniband switch (idebug). We also specify we will need 16 processors. ​ Queue idebug is comprised of nodes ilogin/​ilogin2 each with dual quad CPUs so 2x2x4=16 cores, so we will be using all of them.
 +
 +The most significant change is that we will be calling a '​wrapper'​ script. ​ This script ''​mpich-mpirun''​ wraps the program, surprise, ''​mpirun''​. ​ The reason for this is that the wrapper will build the ''​machines''​ file on the fly.
 +
 +If you want to use the [[http://​www.lam-mpi.org/​|Local Area Multicomputer (LAM)]] MPI libraries use the following wrapper script: ''/​share/​apps/​bin/​mpich-mpirun.gnulam'' ​
 +
 +"make today an [[http://​www.open-mpi.org/​|OpenMPI]] day"
 +
 +===== bsub and bjobs =====
 +
 +Straightforward.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bsub < imyscript
 +Job <​1011>​ is submitted to queue <​idebug>​.
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +1011    hmeij   ​PEND ​ idebug ​    ​swallowtail ​   -        myLittleiJob Apr 19 14:54
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +JOBID   ​USER ​   STAT  QUEUE      FROM_HOST ​  ​EXEC_HOST ​  ​JOB_NAME ​  ​SUBMIT_TIME
 +1011    hmeij   ​RUN ​  ​idebug ​    ​swallowtail\
 +  compute-1-16:​compute-1-16:​compute-1-16:​compute-1-16:​\
 +  compute-1-16:​compute-1-16:​compute-1-16:​compute-1-16:​\
 +  compute-1-15:​compute-1-15:​compute-1-15:​compute-1-15:​\
 +  compute-1-15:​compute-1-15:​compute-1-15:​compute-1-15 myLittleiJob Apr 19 14:54
 +</​code>​
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bjobs
 +No unfinished job found
 +</​code>​
 +
 +Note: as expected 8 cores (EXEC_HOST) were invoked on each node.
 +
 +
 +===== bhist =====
 +
 +You can query the scheduler regarding the status of your job.
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ bhist -l 1011
 +
 +Job <​1011>,​ Job Name <​myLittleiJob>,​ User <​hmeij>,​ Project <​default>,​ Command <
 +                     #​!/​bin/​bash;​ # queue;#BSUB -q idebug -n 16; # email me (##
 +                     SUB) or save in $HOME (#​SUB);##​BSUB -o outfile.email # sta
 +                     ndard ouput;#​BSUB ​ -e outfile.err ​  # standard error; # un
 +                     ique job scratch dirs;​MYSANSCRATCH=/​sanscratch/​$LSB_JOBID>​
 +
 +Thu Apr 19 14:54:19: Submitted from host <​swallowtail>,​ to Queue <​idebug>,​ CWD
 +                     <​$HOME>,​ Error File <​outfile.err>,​ 16 Processors Requested
 +                     ;
 +Thu Apr 19 14:54:24: Dispatched to 16 Hosts/​Processors <​compute-1-16>​ <​compute-
 +                     ​1-16>​ <​compute-1-16>​ <​compute-1-16>​ <​compute-1-16>​ <comput
 +                     ​e-1-16>​ <​compute-1-16>​ <​compute-1-16>​ <​compute-1-15>​ <comp
 +                     ​ute-1-15>​ <​compute-1-15>​ <​compute-1-15>​ <​compute-1-15>​ <co
 +                     ​mpute-1-15>​ <​compute-1-15>​ <​compute-1-15>;​
 +Thu Apr 19 14:54:24: Starting (Pid 6266);
 +Thu Apr 19 14:54:31: Running with execution home </​home/​hmeij>,​ Execution CWD <
 +                     /​home/​hmeij>,​ Execution Pid <​6266>;​
 +Thu Apr 19 14:55:47: Done successfully. The CPU time used is 0.0 seconds;
 +Thu Apr 19 14:55:57: Post job process done successfully;​
 +
 +Summary of time in seconds spent in various states by  Thu Apr 19 14:55:57
 +  PEND     ​PSUSP ​   RUN      USUSP    SSUSP    UNKWN    TOTAL
 +  5        0        83       ​0 ​       0        0        88
 +</​code>​
 +
 +
 +
 +===== Job Ouput =====
 +
 +The above job submission yields ...
 +
 +<​code>​
 +[hmeij@swallowtail ~]$ cat outfile.err ​
 +Process 11 on compute-1-16.local
 +Process 6 on compute-1-15.local
 +Process 14 on compute-1-16.local
 +Process 0 on compute-1-15.local
 +Process 1 on compute-1-15.local
 +Process 2 on compute-1-15.local
 +Process 3 on compute-1-15.local
 +Process 8 on compute-1-16.local
 +Process 4 on compute-1-15.local
 +Process 9 on compute-1-16.local
 +Process 5 on compute-1-15.local
 +Process 10 on compute-1-16.local
 +Process 7 on compute-1-15.local
 +Process 12 on compute-1-16.local
 +Process 13 on compute-1-16.local
 +Process 15 on compute-1-16.local
 +</​code>​
 +
 +and the following email
 +
 +<​code>​
 +Job <​myLittleiJob>​ was submitted from host <​swallowtail>​ by user <​hmeij>​.
 +Job was executed on host(s) <​8*compute-1-16>,​ in queue <​idebug>,​ as user <​hmeij>​.
 +                            <​8*compute-1-15>​
 +</​home/​hmeij>​ was used as the home directory.
 +</​home/​hmeij>​ was used as the working directory.
 +Started at Thu Apr 19 14:54:24 2007
 +Results reported at Thu Apr 19 14:55:47 2007
 +
 +Your job looked like:
 +
 +------------------------------------------------------------
 +# LSBATCH: User input
 +#!/bin/bash
 +
 +# queue
 +#BSUB -q idebug -n 16
 +
 +# email me (##SUB) or save in $HOME (#SUB)
 +##BSUB -o outfile.email # standard ouput
 +#BSUB  -e outfile.err ​  # standard error
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/​sanscratch/​$LSB_JOBID
 +MYLOCALSCRATCH=/​localscratch/​$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +
 +# run my job
 +/​share/​apps/​bin/​mpich-mpirun -np 16 /​share/​apps/​openmpi-1.2/​bin/​cpi
 +
 +# label my job
 +#BSUB -J myLittleiJob
 +
 +
 +------------------------------------------------------------
 +
 +Successfully completed.
 +
 +Resource usage summary:
 +
 +    CPU time   : ​     0.05 sec.
 +    Max Memory :         7 MB
 +    Max Swap   : ​      205 MB
 +
 +    Max Processes ​ :         5
 +    Max Threads ​   :         5
 +
 +The output (if any) follows:
 +
 +pi is approximately 3.1416009869231245,​ Error is 0.0000083333333314
 +wall clock time = 0.312946
 +DONE ... these dirs will be removed via post_exec
 +/​sanscratch/​1011 /​localscratch/​1011
 +
 +PS:
 +
 +Read file <​outfile.err>​ for stderr output of this job.
 +</​code>​
 +
 +
 +
 +
 +===== Bingo =====
 +
 +When i ran these OpenMPI invocations i was also running a HPLinpack benchmark on the nodes on the infiniband (to assess if the nodes would respond). ​ **[[cluster:​26|Follow this to read about the  HPLinpack runs.]]** ​
 +
 +The idebug queue overrides the job slots set for each node (Max Job Slots = # of cores => 8).  It allows for QJOB_LIMIT=16 and UJOB_LIMIT=16. ​ The benchmark is already running 8 jobs per node.  Our job will be asking for 8 more per host.  So basically, the host's job slots are exhausted as well as our user limit.
 +
 +{{:​cluster:​cpi.gif|Cute}}
 +
 +And so it was.\\
 +
 +
 +
 +
 +===== The Problem =====
 +
 +(important i repeat this from another page  --- //​[[hmeij@wesleyan.edu|Henk Meij]] 2007/04/19 15:52//)
 +
 +Once you have your binary compiled, you can execute it on the head node or any other node with a hardcoded ''​machines''​ file specified. Like so (look at the code of this script):
 +
 +<​code>​
 +[hmeij@swallowtail ~]$/​share/​apps/​bin/​cpi.run
 +</​code>​
 +
 +This will not work when submitting your program to ''​bsub''​. ​ Platform reports:
 +
 +<hi yellow>
 +Lava (your scheduler) is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. ​
 +</hi>
 +
 +<hi orange>
 +Also, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically,​ Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts involved in a paralell job. So if, in some circumstances,​ a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue.
 +</hi>
 +
 +
 +And this makes the job submission process for parallel jobs tedious.
 +
 +
 +\\
 +**[[cluster:​28|Back]]**
cluster/32.txt ยท Last modified: 2007/05/16 11:27 (external edit)