Summer 2017 we converted our Wesleyan Matlab license to a campus wide Total Academic Headcount
license. This implies no more license restrictions, so you can run as many Matlab jobs as you wish using the matlab2017b binary. At this time I see no need to license the Distributed Computation Engine in R2017b.
We will leave the default matlab binary pointing to matlab2015a which contains the 16 workers limit of the Distributed Computation Engine.
— Henk 2017/09/29 09:14
This Matlab installation contains the following toolboxes
Only a single user license for each, for now … the Engine can invoke up to 8 “workers” which can run simultaneously without checking out licenses.
This installation supports distributed and parallel computing across 8 “workers” using the Distributed Computing Engine. MPIexec support is built into the Engine. Please consult the web site for details.
The matlab invocation and usage on our head node swallowtail should only be used for code development and job submissions. Once your code is running without errors or your program is submitted, exit Matlab to release the license. The Engine submits your job to the cluster scheduler which in turn finds an idle host to process the job.
There are two types of job users can submit: distributed jobs and parallel jobs. Details regarding the steps to take for either are listed below. This information is gleaned from the Mathworks web site which contains much more detail.
Help page for network administration regarding Distributed Computing Engine
The matlab queue only accepts 8 jobs for submission, with a maximum of 8 worker requests for a single job. This is a restriction of our license. Jobs submitted to the queue will go into a 'pending' state until workers become available.
the maximum number of queued jobs can reach, at which time we should have considered buying more workers
The submission pathway below should only be used to test your ability to submit jobs and gain some experience with the submission process. <hi yellow>Interactive submissions like this lock the matlab session until jobs are either finished or fail</hi>. Do not use this approach for long runnning processes, use the batch submission described below.
If all goes well, 5 tasks are submitted within this single job to the engine which invokes the cluster scheduler which submits the job to the matlab queue which, when appropriate, fires off up to 8 workers to complete the job on the compute nodes. The <hi yellow>waitForState()</hi> call instructs the matlab session to wait until jobs are either finished or failed. Once finished, the results are gathered and displayed on screen (you should see 5 sets of random numbers displayed in the matlab console).
% distributed matlab jobs % start 'matlab -nodisplay', issue the command 'myTestJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'type', 'generic'); set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a') set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done j = createJob(sched); T = createTask(j, @rand, 1, {{3,3} {3,3} {3,3} {3,3} {3,3}}); % submit job, wait for jobs to finish, fetch output submit(j) % WARNING: this may hang if the workers are busy! waitForState(j) results = getAllOutputArguments(j); results{1:5}
No output is generated other than console output. However, the “workers” generate files in the 'DataLocation' area. For each job, you will find a *.mat file for input and output. The Job?.state.mat file contains the status of your job (queued, running, finished, failed). For each task in the job there is a log file which is a text file of the cluster scheduler activity. It'll detail information like what host the task was executed on and resources used.
The idea is you that launch a matlab session, submit your job(s) and immediately exit the matlab session while your jobs run for days. In order to do this, you must turn your m-files into functions. Fairly easy to do and detailed below. So here are the steps:
% distributed matlab jobs % start 'matlab -nodisplay', issue the command 'myJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'type', 'generic'); set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a') set(sched, 'SubmitFcn', @lsfSimpleSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done j = createJob(sched); createTask(j, @tenfunction, 1, {'log.job1',1,2}) createTask(j, @tenfunction, 1, {'log.job2',3,4}) createTask(j, @tenfunction, 1, {'log.job3',5,6}) % submit job and gather scheduled info submit(j) get(sched) % you can now exit matlab % at system prompt type 'bjobs'
To turn your m-file into a function, here are some tips. As depicted above, our function name is 'tenfunction' and we give it 3 input arguments (a string and 2 numbers). Edit the 'tenfunction.m' file and chenge the first line to
function [tempwins]=tenfunction(job,first,last)
in this example our function returns an array. Here is an example to return a number.
function m=tenfunction(job,first,last)
at the very bottom of your m-file add
end
The input arguments can now be used in your code, for example:
for jj=first:last, ...more code
Notice the string we feed the function. If we invoke the same program 3 times, as in this example, we'd like to have some information on what the program is doing while execution takes place. Here is an example to write the progress of the program to a test file which we can read to assess the task progress:
...some code if (count>10) save(job,'template_start','-ASCII'); count=0; end ...more code
The first input argument specifies the log file the task should write to, in this case 'log.job?' (mis labeled really, should be log.task? …). These files are written to the 'matlab' directory and contents can be viewed with a unix commend like 'cat log.job?' on the console.
After you exit matlab and submitted the jobs, you can use scheduler commands to track your tasks. Follow this link for more info about these commands. For example:
[hmeij@swallowtail matlab]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5398 hmeij RUN matlab swallowtail nfs-2-1 /share/apps/matlab/2007a/bin/worker Jul 19 09:05 5399 hmeij RUN matlab swallowtail nfs-2-2 /share/apps/matlab/2007a/bin/worker Jul 19 09:05 5400 hmeij RUN matlab swallowtail compute-2-32 /share/apps/matlab/2007a/bin/worker Jul 19 09:05
Your jobs will also be listed by the Cluster Monitor tool and the Cluster Top tool.
Jesse reports … to get stdout from the matlab processes.
First, you have to enable saving of stdout. To do this, modify the script that creates the tasks and submits them. After all tasks have been created with createtask, but before the job is actually submitted using submit, insert the following lines:
tasks = get(job, 'Tasks'); set(tasks, 'CaptureCommandWindowOutput', true);
where job is the variable for the job that was created with createJob.
After the tasks have run, go to matlabjobs/JobX, where X is the number of the job. Then, start up matlab, and type:
load Task1.out.mat
(assuming Task1 is the task that you're interested in).
This loads several variables into matlab. The variable commandwindowoutput
contains stdout. Information about errors that occurred is contained in and the variable errormessage
and a matlab structure called errorstruct
.
→ sort of interactive, Matlab also has a 'pmode' utility for truely interactive code development pmode Help Page at mathworks.com
The submission of parallel jobs is identical to the distributed submission process with some minor changes. The most significant change is that only one task within one job is defined. That task is then duplicated amongst workers (mathworks calls these workers “labs”). The labs are able to communicate with each other and are 'aware' of each other. Each lab has a unique labindex number which identifies the lab. So, the steps for submissions are:
Assuming you have done the steps detailed in the distributed submission process described in this document, create two files in your '~/matlab' directory. Add the contents below. Here is a brief description of this example task.
In this example, the lab whose labindex value is 1 creates a magic square comprised of a number of rows and columns that is equal to the number of labs running the job (numlabs). In this case, four labs run a parallel job with a 4-by-4 magic square. The first lab broadcasts the matrix with labBroadcast to all the other labs , each of which calculates the sum of one column of the matrix. All of these column sums are combined with the gplus function to calculate the total sum of the elements of the original magic square. |
Launch matlab and submit the command 'myTestPJob' to execute the contents of that file. It sets up the parallel environment, defines a file dependency amongst the labs, executes function colsum while distributing the task across 4 labs and gathers the results. The scheduler is set to 'local', meaning run this on head node, just to get a first taste in submitting parallel jobs. (also implies the waitForState() call will not hang if all workers are busy …)
file: colsum.m
function total_sum = colsum if labindex == 1 % Send magic square to other labs A = labBroadcast(1,magic(numlabs)) else % Receive broadcast on other labs A = labBroadcast(1) end % Calculate sum of column identified by labindex for this lab column_sum = sum(A(:,labindex)) % Calculate total sum by combining column sum from all labs total_sum = gplus(column_sum)
file: myTestPJob.m
% parallel matlab jobs % start 'matlab -nodisplay', issue the command 'myTestPJob' % set up a local scheduler and matlab worker environment sched = findResource('scheduler', 'configuration', 'local') set(sched, 'Configuration', 'local') set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a') % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done pj = createParallelJob(sched); set(pj, 'FileDependencies', {'colsum.m'}) set(pj, 'MaximumNumberOfWorkers', 4) set(pj, 'MinimumNumberOfWorkers', 4) t = createTask(pj, @colsum, 1, {}) % submit job, wait to finish, get status jobs, gather results submit(pj) waitForState(pj) get(pj,'Tasks') results = getAllOutputArguments(pj)
file: myPJob.m
% parallel matlab jobs % start 'matlab -nodisplay', issue the command 'myPJob' % set up the scheduler and matlab worker environment sched = findResource('scheduler', 'configuration', 'generic') set(sched, 'Configuration', 'generic') set(sched, 'HasSharedFilesystem', true); set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2007a') set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn) % specify location for worker output set(sched, 'DataLocation', '/home/hmeij/matlabjobs') % create job and assign tasks to be done pj = createParallelJob(sched); set(pj, 'FileDependencies', {'colsum.m'}) set(pj, 'MaximumNumberOfWorkers', 4) set(pj, 'MinimumNumberOfWorkers', 4) t = createTask(pj, @colsum, 1, {}) % submit job and exit matlab submit(pj) get(pj,'Tasks') % you can now exit matlab % at system prompt type 'bjobs'
>> myPJob ... Job output will be written to: /home/hmeij/matlabjobs/Job1.mpiexec.out BSUB output: Job <5762> is submitted to queue <matlab>. ans = Tasks: 4 by 1 ============= Task ID State End Time Function Name Error ---------------------------------------------------------- 1 pending @colsum 2 pending @colsum 3 pending @colsum 4 pending @colsum >> exit [hmeij@swallowtail matlabjobs]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5762 hmeij RUN matlab swallowtail compute-1-9:compute-1-9:compute-1-9:compute-1-9 Job1 Jul 23 11:11
... Starting SMPD on compute-1-9 ... ssh compute-1-9 "/share/apps/matlab/2007a/bin/mw_smpd" -s -phrase MATLAB -port 25762 All SMPDs launched ... "/share/apps/matlab/2007a/bin/mw_mpiexec" -phrase MATLAB -port 25762 -l -hosts 1 compute-1-9 4 \\ -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION, MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION "/share/apps/matlab/2007a/bin/worker" -parallel ...
WE ARE CURRENTLY NOT RUNNING 2009A … IT APPEARS THAT THE 2009A DC VERSION OF MATLAB STILL HAS 2008A UNDER THE HOOD WITH DC v3.3, SO WE FELL BACK TO 2008A. THE INSTRUCTIONS BELOW ARE FOR WHEN WE UPGRADE IN THE FUTURE.
<hi #ffff00> In the new version of Matlab DC 2009, the integration with LSF is complete, so always point to 'lsf' as the scheduler. Also of note is that you should point to the queue matlab, which spans all hosts.</hi>. So:
Hints on running Matlab jobs using version 2008a. When we ran version 2007a, under scheduler Lava, this page was written up. We were then using generic integration scripts connecting Matlab to scheduler. We have since upgrade to scheduler LSF. We will still be using the generic integration for distributed jobs. However, for parallel jobs, we will be using the lsf integration scripts provide by MathWorks. So everything in this page still applies but with these changes:
⇒ change the location were the matlab is installed
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/2008a')
⇒ comment out or delete the lines referring to the generic functions
%2007a sched = findResource('scheduler', 'configuration', 'generic') %2007a set(sched, 'Configuration', 'generic') ... %2007a set(sched, 'ParallelSubmitFcn', @lsfParallelSubmitFcn)
⇒ point to the LSF scheduler and add following 'bsub' type like parms
sched = findResource('scheduler','type','LSF') set(sched, 'SubmitArguments', '-R type==LINUX64') set(sched, 'SubmitArguments', '-q matlab')