cluster:39

Summer 2017 we converted our Wesleyan Matlab license to a campus wide `Total Academic Headcount`

license. This implies no more license restrictions, so you can run as many Matlab jobs as you wish using the **matlab2017b** binary. At this time I see no need to license the Distributed Computation Engine in R2017b.

We will leave the default **matlab** binary pointing to **matlab2015a** which contains the 16 workers limit of the Distributed Computation Engine.

— *Henk 2017/09/29 09:14*

This Matlab installation contains the following toolboxes

- Data Acquisition Toolbox 2.10
- Distributed Computing Toolbox 3.1
- Image Processing Toolbox 5.4
- Optimization Toolbox 3.1.1
- Filter Design Toolbox 4.1
- Signal Processing Toolbox 6.7
- Symbolic Math Toolbox 3.2
- Statistics Toolbox 6.0
- Wavelet Toolbox 4.0

Only a single user license for each, for now … the Engine can invoke up to 8 “workers” which can run simultaneously without checking out licenses.

This installation supports distributed and parallel computing across 8 “workers” using the Distributed Computing Engine. MPIexec support is built into the Engine. Please consult the web site for details.

The matlab invocation and usage on our head node swallowtail should **only** be used for code development and job submissions. Once your code is running without errors or your program is submitted, exit Matlab to release the license. The Engine submits your job to the cluster scheduler which in turn finds an idle host to process the job.

There are two types of job users can submit: distributed jobs and parallel jobs. Details regarding the steps to take for either are listed below. This information is gleaned from the Mathworks web site which contains much more detail.

**Help page for network administration regarding Distributed Computing Engine**

The matlab queue only accepts 8 jobs for submission, with a maximum of 8 worker requests for a single job. This is a restriction of our license. Jobs submitted to the queue will go into a 'pending' state until workers become available.

the maximum number of queued jobs can reach, at which time we should have considered buying more workers

The submission pathway below should only be used to test your ability to submit jobs and gain some experience with the submission process. <hi yellow>Interactive submissions like this lock the matlab session until jobs are either finished or fail</hi>. Do not use this approach for long runnning processes, use the **batch submission** described below.

- Create a directory 'matlab' in your home directory.
- Create a directory 'matlabjobs' in your home directory.
- Create a text file 'myTestJob.m' in the 'matlab' directory with the contents below.
- Change the location of 'DataLocation' in the appropriate line.
- Start matlab as indicated inside the 'matlab' directory.
- Issue the command 'myTestJob' at the matlab prompt.

If all goes well, 5 **tasks** are submitted within this single **job** to the **engine** which invokes the **cluster scheduler** which submits the job to the **matlab queue** which, when appropriate, fires off up to 8 **workers** to complete the job on the compute nodes. The <hi yellow>waitForState()</hi> call instructs the matlab session to wait until jobs are either finished or failed. Once finished, the results are gathered and displayed on screen (you should see 5 sets of random numbers displayed in the matlab console).

% distributed matlab jobs % start 'matlab -nodisplay', issue the command 'myTestJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'type', 'generic'); set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a') set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done j = createJob(sched); T = createTask(j, @rand, 1, {{3,3} {3,3} {3,3} {3,3} {3,3}}); % submit job, wait for jobs to finish, fetch output submit(j) % WARNING: this may hang if the workers are busy! waitForState(j) results = getAllOutputArguments(j); results{1:5}

No output is generated other than console output. However, the “workers” generate files in the 'DataLocation' area. For each job, you will find a *.mat file for input and output. The Job?.state.mat file contains the status of your job (queued, running, finished, failed). For each task in the job there is a log file which is a text file of the cluster scheduler activity. It'll detail information like what host the task was executed on and resources used.

The idea is you that launch a matlab session, submit your job(s) and immediately exit the matlab session while your jobs run for days. In order to do this, you must turn your m-files into functions. Fairly easy to do and detailed below. So here are the steps:

- Create a text file called 'MyJob.m' inside the 'matlab' directory.
- Transform your m-file to a function (see below).
- Start matlab as indicated inside the 'matlab' directory.
- Issue the command 'myJob' at the matlab prompt.

% distributed matlab jobs % start 'matlab -nodisplay', issue the command 'myJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'type', 'generic'); set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a') set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done j = createJob(sched); createTask(j, @tenfunction, 1, {'log.job1',1,2}) createTask(j, @tenfunction, 1, {'log.job2',3,4}) createTask(j, @tenfunction, 1, {'log.job3',5,6}) % submit job and gather scheduled info submit(j) get(sched) % you can now exit matlab % at system prompt type 'bjobs'

To turn your m-file into a function, here are some tips. As depicted above, our function name is 'tenfunction' and we give it 3 input arguments (a string and 2 numbers). Edit the 'tenfunction.m' file and chenge the first line to

function [tempwins]=tenfunction(job,first,last)

in this example our function returns an array. Here is an example to return a number.

function m=tenfunction(job,first,last)

at the very bottom of your m-file add

end

The input arguments can now be used in your code, for example:

for jj=first:last, ...more code

Notice the string we feed the function. If we invoke the same program 3 times, as in this example, we'd like to have some information on what the program is doing while execution takes place. Here is an example to write the progress of the program to a test file which we can read to assess the task progress:

...some code if (count>10) save(job,'template_start','-ASCII'); count=0; end ...more code

The first input argument specifies the log file the task should write to, in this case 'log.job?' (mis labeled really, should be log.task? …). These files are written to the 'matlab' directory and contents can be viewed with a unix commend like 'cat log.job?' on the console.

After you exit matlab and submitted the jobs, you can use scheduler commands to track your tasks. **Follow this link for more info about these commands**. For example:

[hmeij@swallowtail matlab]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5398 hmeij RUN matlab swallowtail nfs-2-1 /share/apps/matlab/2007a/bin/worker Jul 19 09:05 5399 hmeij RUN matlab swallowtail nfs-2-2 /share/apps/matlab/2007a/bin/worker Jul 19 09:05 5400 hmeij RUN matlab swallowtail compute-2-32 /share/apps/matlab/2007a/bin/worker Jul 19 09:05

Your jobs will also be listed by the **Cluster Monitor** tool and the **Cluster Top** tool.

Jesse reports … to get stdout from the matlab processes.

First, you have to enable saving of stdout. To do this, modify the script that creates the tasks and submits them. After all tasks have been created with createtask, but before the job is actually submitted using submit, insert the following lines:

tasks = get(job, 'Tasks'); set(tasks, 'CaptureCommandWindowOutput', true);

where job is the variable for the job that was created with createJob.

After the tasks have run, go to matlabjobs/JobX, where X is the number of the job. Then, start up matlab, and type:

`load Task1.out.mat`

(assuming Task1 is the task that you're interested in).

This loads several variables into matlab. The variable `commandwindowoutput`

contains stdout. Information about errors that occurred is contained in and the variable `errormessage`

and a matlab structure called `errorstruct`

.

→ sort of interactive, Matlab also has a 'pmode' utility for truely interactive code development pmode Help Page at mathworks.com

The submission of parallel jobs is identical to the distributed submission process with some minor changes. The most significant change is that only **one task** within **one job** is defined. That task is then duplicated amongst workers (mathworks calls these workers *“labs”*). The labs are able to communicate with each other and are 'aware' of each other. Each lab has a unique *labindex* number which identifies the lab. So, the steps for submissions are:

- Find a scheduler.
- Create a parallel job.
- Define a task.
- Submit the job.
- Retrieve the results.

Assuming you have done the steps detailed in the distributed submission process described in this document, create two files in your '~/matlab' directory. Add the contents below. Here is a brief description of this example task.

In this example, the lab whose labindex value is 1 creates a magic square comprised of a number of rows and columns that is equal to the number of labs running the job (numlabs). In this case, four labs run a parallel job with a 4-by-4 magic square. The first lab broadcasts the matrix with labBroadcast to all the other labs , each of which calculates the sum of one column of the matrix. All of these column sums are combined with the gplus function to calculate the total sum of the elements of the original magic square. |

Launch matlab and submit the command 'myTestPJob' to execute the contents of that file. It sets up the parallel environment, defines a file dependency amongst the labs, executes function colsum while distributing the task across 4 labs and gathers the results. The scheduler is set to 'local', meaning run this on head node, just to get a first taste in submitting parallel jobs. (also implies the waitForState() call will not hang if all workers are busy …)

** file: colsum.m **

function total_sum = colsum if labindex == 1 % Send magic square to other labs A = labBroadcast(1,magic(numlabs)) else % Receive broadcast on other labs A = labBroadcast(1) end % Calculate sum of column identified by labindex for this lab column_sum = sum(A(:,labindex)) % Calculate total sum by combining column sum from all labs total_sum = gplus(column_sum)

** file: myTestPJob.m **

% parallel matlab jobs % start 'matlab -nodisplay', issue the command 'myTestPJob' % set up a local scheduler and matlab worker environment sched = findResource('scheduler', 'configuration', 'local') set(sched, 'Configuration', 'local') set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a') % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done pj = createParallelJob(sched); set(pj, 'FileDependencies', {'colsum.m'}) set(pj, 'MaximumNumberOfWorkers', 4) set(pj, 'MinimumNumberOfWorkers', 4) t = createTask(pj, @colsum, 1, {}) % submit job, wait to finish, get status jobs, gather results submit(pj) waitForState(pj) get(pj,'Tasks') results = getAllOutputArguments(pj)

- If you need to submit more arguments to the scheduler, for example direct a worker to a specific machine for memory requirements rather than a random pick, you can do via these commands described at this External Link

- Ok, submitting in a more batch oriented mode looks like this.
- As usual, change the 'DataLocation' line.

** file: myPJob.m **

% parallel matlab jobs % start 'matlab -nodisplay', issue the command 'myPJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'configuration', 'generic') set(sched, 'Configuration', 'generic') set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a') set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done pj = createParallelJob(sched); set(pj, 'FileDependencies', {'colsum.m'}) set(pj, 'MaximumNumberOfWorkers', 4) set(pj, 'MinimumNumberOfWorkers', 4) t = createTask(pj, @colsum, 1, {}) % submit job and exit matlab submit(pj) get(pj,'Tasks') % you can now exit matlab % at system prompt type 'bjobs'

- once submitted the 'myPJob' invocation will report the submission
- it also reports wich file contains the job output
- 'bjobs' will also report the job submission at the shell prompt

>> myPJob ... Job output will be written to: /home/hmeij/matlabjobs/Job1.mpiexec.out BSUB output: Job <5762> is submitted to queue <matlab>. ans = Tasks: 4 by 1 ============= Task ID State End Time Function Name Error ---------------------------------------------------------- 1 pending @colsum 2 pending @colsum 3 pending @colsum 4 pending @colsum >> exit [hmeij@swallowtail matlabjobs]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5762 hmeij RUN matlab swallowtail compute-1-9:compute-1-9:compute-1-9:compute-1-9 Job1 Jul 23 11:11

- The contents of the output shows that the workers start SMPD processes. These are the “daemons” that will perform the message passing amongst the workers. At the current time we're using the MathWorks MPI (Message Passing Interface) that ships with this release and is invoked with the command 'mw_mpiexec'.

- Once the workers have started, each
*“lab”*now has a unique*“labindex”*, in our case [0],[1],[2] and [3] … you can trace the communications back and forth between the*“labs”*by reading the log.

... Starting SMPD on compute-1-9 ... ssh compute-1-9 "/share/apps/matlab/2007a/bin/mw_smpd" -s -phrase MATLAB -port 25762 All SMPDs launched ... "/share/apps/matlab/2007a/bin/mw_mpiexec" -phrase MATLAB -port 25762 -l -hosts 1 compute-1-9 4 \\ -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION, MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION "/share/apps/matlab/2007a/bin/worker" -parallel ...

WE ARE CURRENTLY NOT RUNNING 2009A … IT APPEARS THAT THE 2009A DC VERSION OF MATLAB STILL HAS 2008A UNDER THE HOOD WITH DC v3.3, SO WE FELL BACK TO 2008A. THE INSTRUCTIONS BELOW ARE FOR WHEN WE UPGRADE IN THE FUTURE.

<hi #ffff00>
In the new version of Matlab DC 2009, the integration with LSF is complete, so **always** point to 'lsf' as the scheduler. Also of note is that you should point to the queue matlab, which spans all hosts.</hi>. So:

- In the line
*sched = findResource('scheduler', 'type', 'generic');*replace 'generic' with 'lsf'

- Comment out the line that contains
*set = (sched, 'SubmitFcn', @lsfSimpleSubmitFcn');*

- And point to the queue
*set(sched, 'SubmitArguments', '-q matlab');*

Hints on running Matlab jobs using version 2008a. When we ran version 2007a, under scheduler Lava, this page was written up. We were then using generic integration scripts connecting Matlab to scheduler. We have since upgrade to scheduler LSF. We will still be using the generic integration for distributed jobs. However, for parallel jobs, we will be using the lsf integration scripts provide by MathWorks. So everything in this page still applies but with these changes:

- Distributed and Parallel Jobs

⇒ change the location were the matlab is installed

set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a')

- Parallel Jobs

⇒ comment out or delete the lines referring to the generic functions

%2007a sched = findResource('scheduler', 'configuration', 'generic') %2007a set(sched, 'Configuration', 'generic') ... %2007a set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn)

⇒ point to the LSF scheduler and add following 'bsub' type like parms

sched = findResource('scheduler','type','LSF') set(sched, 'SubmitArguments', '-R type==LINUX64') set(sched, 'SubmitArguments', '-q matlab')

cluster/39.txt · Last modified: 2017/09/29 09:19 by hmeij07