User Tools

Site Tools


cluster:39

This is an old revision of the document!



Back

<hi #ffff00> Update: in the new version of Matlab DC, the integration with LSF is complete, so always point to 'lsf' as the scheduler. Also of note is that you should point to the queue matlab, which spans all hosts.</hi>. So:

  • In the line sched = findResource('scheduler', 'type', 'generic'); replace 'generic' with 'lsf'
  • Comment out the line that contains set = (sched, 'SubmitFcn', @lsfSimpleSubmitFcn');
  • And point to the queue set(sched, 'SubmitArguments', '-q matlab');

Meij, Henk 2009/07/02 09:14

Hints on running Matlab jobs using version 2008a. When we ran version 2007a, under scheduler Lava, this page was written up. We were then using generic integration scripts connecting Matlab to scheduler. We have since upgrade to scheduler LSF. We will still be using the generic integration for distributed jobs. However, for parallel jobs, we will be using the lsf integration scripts provide by MathWorks. So everything in this page still applies but with these changes:

  • Distributed and Parallel Jobs

⇒ change the location were the matlab is installed

set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a')
  • Parallel Jobs

⇒ comment out or delete the lines referring to the generic functions

%2007a sched = findResource('scheduler', 'configuration', 'generic')
%2007a set(sched, 'Configuration', 'generic')
...
%2007a set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn)

⇒ point to the LSF scheduler and add following 'bsub' type like parms

sched = findResource('scheduler','type','LSF')
set(sched, 'SubmitArguments', '-R type==LINUX64')
set(sched, 'SubmitArguments', '-q matlab')

Matlab

This Matlab installation contains the following toolboxes

  • Data Acquisition Toolbox 2.10
  • Distributed Computing Toolbox 3.1
  • Image Processing Toolbox 5.4
  • Optimization Toolbox 3.1.1
  • Filter Design Toolbox 4.1
  • Signal Processing Toolbox 6.7
  • Symbolic Math Toolbox 3.2
  • Statistics Toolbox 6.0
  • Wavelet Toolbox 4.0

Only a single user license for each, for now … the Engine can invoke up to 8 “workers” which can run simultaneously without checking out licenses.

This installation supports distributed and parallel computing across 8 “workers” using the Distributed Computing Engine. MPIexec support is built into the Engine. Please consult the web site for details.

The matlab invocation and usage on our head node swallowtail should only be used for code development and job submissions. Once your code is running without errors or your program is submitted, exit Matlab to release the license. The Engine submits your job to the cluster scheduler which in turn finds an idle host to process the job.

There are two types of job users can submit: distributed jobs and parallel jobs. Details regarding the steps to take for either are listed below. This information is gleaned from the Mathworks web site which contains much more detail.

Help page for network administration regarding Distributed Computing Engine

Queue

The matlab queue only accepts 8 jobs for submission, with a maximum of 8 worker requests for a single job. This is a restriction of our license. Jobs submitted to the queue will go into a 'pending' state until workers become available.

the maximum number of queued jobs can reach, at which time we should have considered buying more workers ;-)

Distributed Computing: interactive

Help Page at mathworks.com

The submission pathway below should only be used to test your ability to submit jobs and gain some experience with the submission process. <hi yellow>Interactive submissions like this lock the matlab session until jobs are either finished or fail</hi>. Do not use this approach for long runnning processes, use the batch submission described below.

  • Create a directory 'matlab' in your home directory.
  • Create a directory 'matlabjobs' in your home directory.
  • Create a text file 'myTestJob.m' in the 'matlab' directory with the contents below.
  • Change the location of 'DataLocation' in the appropriate line.
  • Start matlab as indicated inside the 'matlab' directory.
  • Issue the command 'myTestJob' at the matlab prompt.

If all goes well, 5 tasks are submitted within this single job to the engine which invokes the cluster scheduler which submits the job to the matlab queue which, when appropriate, fires off up to 8 workers to complete the job on the compute nodes. The <hi yellow>waitForState()</hi> call instructs the matlab session to wait until jobs are either finished or failed. Once finished, the results are gathered and displayed on screen (you should see 5 sets of random numbers displayed in the matlab console).

% distributed matlab jobs
% start 'matlab -nodisplay', issue the command 'myTestJob'

% set up the scheduler and matlab worker environment
sched = findResource('scheduler', 'type', 'generic');
set(sched, 'HasSharedFilesystem', true);
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a')
set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn)   

% specify location for worker output
set(sched, 'DataLocation', '/home/hmeij/matlabjobs')

% create job and assign tasks to be done
j = createJob(sched);
T = createTask(j, @rand, 1, {{3,3} {3,3} {3,3} {3,3} {3,3}});

% submit job, wait for jobs to finish, fetch output
submit(j)
% WARNING: this may hang if the workers are busy!
waitForState(j)
results = getAllOutputArguments(j);
results{1:5}

No output is generated other than console output. However, the “workers” generate files in the 'DataLocation' area. For each job, you will find a *.mat file for input and output. The Job?.state.mat file contains the status of your job (queued, running, finished, failed). For each task in the job there is a log file which is a text file of the cluster scheduler activity. It'll detail information like what host the task was executed on and resources used.

Distributed Computing: batch

The idea is you that launch a matlab session, submit your job(s) and immediately exit the matlab session while your jobs run for days. In order to do this, you must turn your m-files into functions. Fairly easy to do and detailed below. So here are the steps:

  • Create a text file called 'MyJob.m' inside the 'matlab' directory.
  • Transform your m-file to a function (see below).
  • Start matlab as indicated inside the 'matlab' directory.
  • Issue the command 'myJob' at the matlab prompt.
% distributed matlab jobs
% start 'matlab -nodisplay', issue the command 'myJob'

% set up the scheduler and matlab worker environment
sched = findResource('scheduler', 'type', 'generic');
set(sched, 'HasSharedFilesystem', true);
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a')
set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn)   

% specify location for worker output
set(sched, 'DataLocation', '/home/hmeij/matlabjobs')

% create job and assign tasks to be done
j = createJob(sched);
createTask(j, @tenfunction, 1, {'log.job1',1,2})
createTask(j, @tenfunction, 1, {'log.job2',3,4})
createTask(j, @tenfunction, 1, {'log.job3',5,6})

% submit job and gather scheduled info
submit(j)
get(sched)

% you can now exit matlab
% at system prompt type 'bjobs' 

To turn your m-file into a function, here are some tips. As depicted above, our function name is 'tenfunction' and we give it 3 input arguments (a string and 2 numbers). Edit the 'tenfunction.m' file and chenge the first line to

function [tempwins]=tenfunction(job,first,last)

in this example our function returns an array. Here is an example to return a number.

function m=tenfunction(job,first,last)

at the very bottom of your m-file add

end

The input arguments can now be used in your code, for example:

for jj=first:last,
...more code

Notice the string we feed the function. If we invoke the same program 3 times, as in this example, we'd like to have some information on what the program is doing while execution takes place. Here is an example to write the progress of the program to a test file which we can read to assess the task progress:

...some code
  if (count>10)
     save(job,'template_start','-ASCII');
     count=0;
  end
...more code

The first input argument specifies the log file the task should write to, in this case 'log.job?' (mis labeled really, should be log.task? …). These files are written to the 'matlab' directory and contents can be viewed with a unix commend like 'cat log.job?' on the console.

After you exit matlab and submitted the jobs, you can use scheduler commands to track your tasks. Follow this link for more info about these commands. For example:

[hmeij@swallowtail matlab]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5398    hmeij   RUN   matlab     swallowtail nfs-2-1     /share/apps/matlab/2007a/bin/worker Jul 19 09:05
5399    hmeij   RUN   matlab     swallowtail nfs-2-2     /share/apps/matlab/2007a/bin/worker Jul 19 09:05
5400    hmeij   RUN   matlab     swallowtail compute-2-32 /share/apps/matlab/2007a/bin/worker Jul 19 09:05

Your jobs will also be listed by the Cluster Monitor tool and the Cluster Top tool.

STDOUT & STDERR

Jesse reports … to get stdout from the matlab processes.

First, you have to enable saving of stdout. To do this, modify the script that creates the tasks and submits them. After all tasks have been created with createtask, but before the job is actually submitted using submit, insert the following lines:

tasks = get(job, 'Tasks');
set(tasks, 'CaptureCommandWindowOutput', true);

where job is the variable for the job that was created with createJob.

After the tasks have run, go to matlabjobs/JobX, where X is the number of the job. Then, start up matlab, and type:

load Task1.out.mat

(assuming Task1 is the task that you're interested in).

This loads several variables into matlab. The variable commandwindowoutput contains stdout. Information about errors that occurred is contained in and the variable errormessage and a matlab structure called errorstruct.

Parallel Computing "Interactive"

→ sort of interactive, Matlab also has a 'pmode' utility for truely interactive code development pmode Help Page at mathworks.com

Help page at mathworks.com

The submission of parallel jobs is identical to the distributed submission process with some minor changes. The most significant change is that only one task within one job is defined. That task is then duplicated amongst workers (mathworks calls these workers “labs”). The labs are able to communicate with each other and are 'aware' of each other. Each lab has a unique labindex number which identifies the lab. So, the steps for submissions are:

  • Find a scheduler.
  • Create a parallel job.
  • Define a task.
  • Submit the job.
  • Retrieve the results.

Assuming you have done the steps detailed in the distributed submission process described in this document, create two files in your '~/matlab' directory. Add the contents below. Here is a brief description of this example task.

In this example, the lab whose labindex value is 1 creates a magic square comprised of a number of rows and columns that is equal to the number of labs running the job (numlabs). In this case, four labs run a parallel job with a 4-by-4 magic square. The first lab broadcasts the matrix with labBroadcast to all the other labs , each of which calculates the sum of one column of the matrix. All of these column sums are combined with the gplus function to calculate the total sum of the elements of the original magic square.

Launch matlab and submit the command 'myTestPJob' to execute the contents of that file. It sets up the parallel environment, defines a file dependency amongst the labs, executes function colsum while distributing the task across 4 labs and gathers the results. The scheduler is set to 'local', meaning run this on head node, just to get a first taste in submitting parallel jobs. (also implies the waitForState() call will not hang if all workers are busy …)

file: colsum.m

function total_sum = colsum
if labindex == 1
    % Send magic square to other labs
    A = labBroadcast(1,magic(numlabs)) 
else
    % Receive broadcast on other labs
    A = labBroadcast(1) 
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex))

% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum)

file: myTestPJob.m

% parallel matlab jobs
% start 'matlab -nodisplay', issue the command 'myTestPJob'

% set up a local scheduler and matlab worker environment
sched = findResource('scheduler', 'configuration', 'local')
set(sched, 'Configuration', 'local')

set(sched, 'HasSharedFilesystem', true);
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a')

% specify location for worker output
set(sched, 'DataLocation', '/home/hmeij/matlabjobs')

% create job and assign tasks to be done
pj = createParallelJob(sched);
set(pj, 'FileDependencies', {'colsum.m'})

set(pj, 'MaximumNumberOfWorkers', 4)
set(pj, 'MinimumNumberOfWorkers', 4)

t = createTask(pj, @colsum, 1, {})

% submit job, wait to finish, get status jobs, gather results
submit(pj)
waitForState(pj)
get(pj,'Tasks')
results = getAllOutputArguments(pj)

Parallel Computing batch

  • Ok, submitting in a more batch oriented mode looks like this.
  • As usual, change the 'DataLocation' line.

file: myPJob.m

% parallel matlab jobs
% start 'matlab -nodisplay', issue the command 'myPJob'

% set up the scheduler and matlab worker environment
sched = findResource('scheduler', 'configuration', 'generic')
set(sched, 'Configuration', 'generic')

set(sched, 'HasSharedFilesystem', true);
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a')
set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn)   

% specify location for worker output
set(sched, 'DataLocation', '/home/hmeij/matlabjobs')

% create job and assign tasks to be done
pj = createParallelJob(sched);
set(pj, 'FileDependencies', {'colsum.m'})

set(pj, 'MaximumNumberOfWorkers', 4)
set(pj, 'MinimumNumberOfWorkers', 4)

t = createTask(pj, @colsum, 1, {})

% submit job and exit matlab
submit(pj)
get(pj,'Tasks')

% you can now exit matlab
% at system prompt type 'bjobs' 
  • once submitted the 'myPJob' invocation will report the submission
  • it also reports wich file contains the job output
  • 'bjobs' will also report the job submission at the shell prompt
>> myPJob

...

Job output will be written to: /home/hmeij/matlabjobs/Job1.mpiexec.out
BSUB output: Job <5762> is submitted to queue <matlab>.


ans =

    Tasks: 4 by 1
    =============

 Task ID   State    End Time           Function Name  Error
 ----------------------------------------------------------
       1   pending                           @colsum
       2   pending                           @colsum
       3   pending                           @colsum
       4   pending                           @colsum

>> exit


[hmeij@swallowtail matlabjobs]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5762    hmeij   RUN   matlab     swallowtail compute-1-9:compute-1-9:compute-1-9:compute-1-9 Job1       Jul 23 11:11
  • The contents of the output shows that the workers start SMPD processes. These are the “daemons” that will perform the message passing amongst the workers. At the current time we're using the MathWorks MPI (Message Passing Interface) that ships with this release and is invoked with the command 'mw_mpiexec'.
  • Once the workers have started, each “lab” now has a unique “labindex”, in our case [0],[1],[2] and [3] … you can trace the communications back and forth between the “labs” by reading the log.
...
Starting SMPD on  compute-1-9 ...
ssh compute-1-9 "/share/apps/matlab/2007a/bin/mw_smpd" -s -phrase MATLAB -port 25762
All SMPDs launched
...
"/share/apps/matlab/2007a/bin/mw_mpiexec" -phrase MATLAB -port 25762 
  -l -hosts 1 compute-1-9 4 \\
  -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,
            MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION 
  "/share/apps/matlab/2007a/bin/worker" -parallel
...


Back

cluster/39.1246540693.txt.gz · Last modified: 2009/07/02 09:18 by hmeij