cluster:32
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
| — | cluster:32 [2007/05/16 15:27] (current) – created - external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | \\ | ||
| + | **[[cluster: | ||
| + | => Lava, the scheduler, is not natively capable for parallel jobs submissions. | ||
| + | |||
| + | => There is a splendid course offered by NCSA at UIUC about MPI. If you're serious about MPI, take it; you can find a link to access this course **[[cluster: | ||
| + | |||
| + | => In all the examples below, '' | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | ===== Jobs ===== | ||
| + | |||
| + | Infiniband! | ||
| + | |||
| + | PLEASE READ THE 'ENV TEST' SECTION, IT'LL EXPLAIN WHY IT IS COMPLICATED. | ||
| + | Also, you need to test that your environment is set up correctly | ||
| + | |||
| + | This write up will only focus on how to submit jobs using scripts, meaning in batch mode. A single bash shell script (they must be bash shells!) will submit myscript to the scheduler. | ||
| + | |||
| + | **imyscript** | ||
| + | < | ||
| + | #!/bin/bash | ||
| + | |||
| + | # queue | ||
| + | #BSUB -q idebug -n 16 | ||
| + | |||
| + | # email me (##SUB) or save in $HOME (#SUB) | ||
| + | ##BSUB -o outfile.email # standard ouput | ||
| + | #BSUB -e outfile.err | ||
| + | |||
| + | # unique job scratch dirs | ||
| + | MYSANSCRATCH=/ | ||
| + | MYLOCALSCRATCH=/ | ||
| + | export MYSANSCRATCH MYLOCALSCRATCH | ||
| + | |||
| + | # run my job | ||
| + | / | ||
| + | |||
| + | echo DONE ... these dirs will be removed via post_exec | ||
| + | echo $MYSANSCRATCH $MYLOCALSCRATCH | ||
| + | |||
| + | # label my job | ||
| + | #BSUB -J myLittleiJob | ||
| + | </ | ||
| + | |||
| + | This looks much like the non-infiniband job submissions but there are some key changes. | ||
| + | |||
| + | The most significant change is that we will be calling a ' | ||
| + | |||
| + | If you want to use the [[http:// | ||
| + | |||
| + | "make today an [[http:// | ||
| + | |||
| + | ===== bsub and bjobs ===== | ||
| + | |||
| + | Straightforward. | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ bsub < imyscript | ||
| + | Job < | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ bjobs | ||
| + | JOBID | ||
| + | 1011 hmeij | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ bjobs | ||
| + | JOBID | ||
| + | 1011 hmeij | ||
| + | compute-1-16: | ||
| + | compute-1-16: | ||
| + | compute-1-15: | ||
| + | compute-1-15: | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ bjobs | ||
| + | No unfinished job found | ||
| + | </ | ||
| + | |||
| + | Note: as expected 8 cores (EXEC_HOST) were invoked on each node. | ||
| + | |||
| + | |||
| + | ===== bhist ===== | ||
| + | |||
| + | You can query the scheduler regarding the status of your job. | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ bhist -l 1011 | ||
| + | |||
| + | Job < | ||
| + | # | ||
| + | SUB) or save in $HOME (# | ||
| + | ndard ouput;# | ||
| + | ique job scratch dirs; | ||
| + | |||
| + | Thu Apr 19 14:54:19: Submitted from host < | ||
| + | < | ||
| + | ; | ||
| + | Thu Apr 19 14:54:24: Dispatched to 16 Hosts/ | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | Thu Apr 19 14:54:24: Starting (Pid 6266); | ||
| + | Thu Apr 19 14:54:31: Running with execution home </ | ||
| + | / | ||
| + | Thu Apr 19 14:55:47: Done successfully. The CPU time used is 0.0 seconds; | ||
| + | Thu Apr 19 14:55:57: Post job process done successfully; | ||
| + | |||
| + | Summary of time in seconds spent in various states by Thu Apr 19 14:55:57 | ||
| + | PEND | ||
| + | 5 0 83 | ||
| + | </ | ||
| + | |||
| + | |||
| + | |||
| + | ===== Job Ouput ===== | ||
| + | |||
| + | The above job submission yields ... | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$ cat outfile.err | ||
| + | Process 11 on compute-1-16.local | ||
| + | Process 6 on compute-1-15.local | ||
| + | Process 14 on compute-1-16.local | ||
| + | Process 0 on compute-1-15.local | ||
| + | Process 1 on compute-1-15.local | ||
| + | Process 2 on compute-1-15.local | ||
| + | Process 3 on compute-1-15.local | ||
| + | Process 8 on compute-1-16.local | ||
| + | Process 4 on compute-1-15.local | ||
| + | Process 9 on compute-1-16.local | ||
| + | Process 5 on compute-1-15.local | ||
| + | Process 10 on compute-1-16.local | ||
| + | Process 7 on compute-1-15.local | ||
| + | Process 12 on compute-1-16.local | ||
| + | Process 13 on compute-1-16.local | ||
| + | Process 15 on compute-1-16.local | ||
| + | </ | ||
| + | |||
| + | and the following email | ||
| + | |||
| + | < | ||
| + | Job < | ||
| + | Job was executed on host(s) < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | Started at Thu Apr 19 14:54:24 2007 | ||
| + | Results reported at Thu Apr 19 14:55:47 2007 | ||
| + | |||
| + | Your job looked like: | ||
| + | |||
| + | ------------------------------------------------------------ | ||
| + | # LSBATCH: User input | ||
| + | #!/bin/bash | ||
| + | |||
| + | # queue | ||
| + | #BSUB -q idebug -n 16 | ||
| + | |||
| + | # email me (##SUB) or save in $HOME (#SUB) | ||
| + | ##BSUB -o outfile.email # standard ouput | ||
| + | #BSUB -e outfile.err | ||
| + | |||
| + | # unique job scratch dirs | ||
| + | MYSANSCRATCH=/ | ||
| + | MYLOCALSCRATCH=/ | ||
| + | export MYSANSCRATCH MYLOCALSCRATCH | ||
| + | |||
| + | # run my job | ||
| + | / | ||
| + | |||
| + | # label my job | ||
| + | #BSUB -J myLittleiJob | ||
| + | |||
| + | |||
| + | ------------------------------------------------------------ | ||
| + | |||
| + | Successfully completed. | ||
| + | |||
| + | Resource usage summary: | ||
| + | |||
| + | CPU time : | ||
| + | Max Memory : 7 MB | ||
| + | Max Swap : | ||
| + | |||
| + | Max Processes | ||
| + | Max Threads | ||
| + | |||
| + | The output (if any) follows: | ||
| + | |||
| + | pi is approximately 3.1416009869231245, | ||
| + | wall clock time = 0.312946 | ||
| + | DONE ... these dirs will be removed via post_exec | ||
| + | / | ||
| + | |||
| + | PS: | ||
| + | |||
| + | Read file < | ||
| + | </ | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | ===== Bingo ===== | ||
| + | |||
| + | When i ran these OpenMPI invocations i was also running a HPLinpack benchmark on the nodes on the infiniband (to assess if the nodes would respond). | ||
| + | |||
| + | The idebug queue overrides the job slots set for each node (Max Job Slots = # of cores => 8). It allows for QJOB_LIMIT=16 and UJOB_LIMIT=16. | ||
| + | |||
| + | {{: | ||
| + | |||
| + | And so it was.\\ | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | ===== The Problem ===== | ||
| + | |||
| + | (important i repeat this from another page --- // | ||
| + | |||
| + | Once you have your binary compiled, you can execute it on the head node or any other node with a hardcoded '' | ||
| + | |||
| + | < | ||
| + | [hmeij@swallowtail ~]$/ | ||
| + | </ | ||
| + | |||
| + | This will not work when submitting your program to '' | ||
| + | |||
| + | <hi yellow> | ||
| + | Lava (your scheduler) is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. | ||
| + | </hi> | ||
| + | |||
| + | <hi orange> | ||
| + | Also, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, | ||
| + | </hi> | ||
| + | |||
| + | |||
| + | And this makes the job submission process for parallel jobs tedious. | ||
| + | |||
| + | |||
| + | \\ | ||
| + | **[[cluster: | ||
cluster/32.txt · Last modified: by 127.0.0.1
