User Tools

Site Tools


cluster:31

Warning: Undefined array key 0 in /usr/share/dokuwiki/inc/html.php on line 1271

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1164

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1168

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1171

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1172

Warning: Undefined array key 0 in /usr/share/dokuwiki/inc/ChangeLog/ChangeLog.php on line 345

Warning: Undefined array key 1 in /usr/share/dokuwiki/inc/html.php on line 1453

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1454

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:31 [2007/04/19 15:45] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:32|Back]]**
  
 +====== OpenMPI ENV ======
 +
 +
 +===== Tests =====
 +
 +To test your environment execute the following two binaries and compare the output.  It should all be set up for you already.  If not, contact the HPCadmin.
 +
 +**#1**
 +<code>
 +[hmeij@swallowtail ~]$ /share/apps/bin/hello.run
 +Running on ilogin1 and ilogin2 with -np=16
 +Hello, world, I am 0 of 16
 +Hello, world, I am 11 of 16
 +Hello, world, I am 1 of 16
 +Hello, world, I am 2 of 16
 +Hello, world, I am 3 of 16
 +Hello, world, I am 4 of 16
 +Hello, world, I am 5 of 16
 +Hello, world, I am 6 of 16
 +Hello, world, I am 7 of 16
 +Hello, world, I am 8 of 16
 +Hello, world, I am 9 of 16
 +Hello, world, I am 10 of 16
 +Hello, world, I am 12 of 16
 +Hello, world, I am 13 of 16
 +Hello, world, I am 14 of 16
 +Hello, world, I am 15 of 16
 +</code>
 +
 +**#2**
 +<code>
 +[hmeij@swallowtail ~]$ /share/apps/bin/cpi.run
 +Running on ilogin1 and ilogin2 with -np=16
 +Process 10 on compute-1-16.local
 +Process 0 on compute-1-15.local
 +Process 2 on compute-1-15.local
 +Process 3 on compute-1-15.local
 +Process 4 on compute-1-15.local
 +Process 5 on compute-1-15.local
 +Process 6 on compute-1-15.local
 +Process 7 on compute-1-15.local
 +Process 1 on compute-1-15.local
 +pi is approximately 3.1416009869231245, Error is 0.0000083333333314
 +wall clock time = 0.166646
 +Process 8 on compute-1-16.local
 +Process 9 on compute-1-16.local
 +Process 11 on compute-1-16.local
 +Process 12 on compute-1-16.local
 +Process 13 on compute-1-16.local
 +Process 14 on compute-1-16.local
 +Process 15 on compute-1-16.local
 +</code>
 +
 +done. For those that are interested, below is the what & where of OpenMPI on our cluster.
 +
 +
 +
 +
 +===== OpenMPI =====
 +
 +install directory: ''/share/apps/openmpi-1.2''
 +
 +... you can add the ''bin/'' subdirectory to your path if you want.  Not really necessary as long as you provide the full path to the binaries in your scripts.
 +
 +The two scripts ''hello.run'' and ''cpi.run'' invoke the ''mpirun'' binary, like so
 +
 +<code>
 +#!/bin/bash
 +
 +echo Running on ilogin1 and ilogin2 with -np=16
 +
 +/share/apps/openmpi-1.2/bin/mpirun -np 16 \
 +  -machinefile /share/apps/openmpi-1.2/bin/hello.machines \
 +  /share/apps/openmpi-1.2/bin/hello
 +</code>
 +
 +The two binaries have libraries linked in, like so
 +
 +<code>
 +[hmeij@swallowtail ~]# ldd /share/apps/openmpi-1.2/bin/hello
 +        libmpi.so.0 => /share/apps/openmpi-1.2/lib/libmpi.so.0 (0x0000002a95557000)
 +        libopen-rte.so.0 => /share/apps/openmpi-1.2/lib/libopen-rte.so.0 (0x0000002a956eb000)
 +        libopen-pal.so.0 => /share/apps/openmpi-1.2/lib/libopen-pal.so.0 (0x0000002a95844000)
 +        libdl.so.2 => /lib64/libdl.so.2 (0x0000003684000000)
 +        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003686d00000)
 +        libutil.so.1 => /lib64/libutil.so.1 (0x0000003688600000)
 +        libm.so.6 => /lib64/tls/libm.so.6 (0x0000003683e00000)
 +        libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003684400000)
 +        libc.so.6 => /lib64/tls/libc.so.6 (0x0000003683b00000)
 +        /lib64/ld-linux-x86-64.so.2 (0x0000003683900000)
 +</code>
 +
 +===== Compiling =====
 +
 +When you compile for example C code for OpenMPI
 +
 +<code>
 +/share/apps/openmpi-1.2/bin/mpicc -o ./mpi /share/apps/openmpi-1.2/bin/cpi.c
 +</code>
 +
 +check that the create binary finds all the libraries with ''ldd'' (see output above)
 +
 +
 +
 +
 +===== The Problem =====
 +
 +Once you have your binary compiled, you can execute it on the head node or any other as described above.  But the ''hello.run'' and ''cpi.run'' programs point the ''mpirun'' program to a hardcoded "machines" file.
 +
 +This will not work when submitting your program to ''bsub'' Platform reports:
 +
 +<hi yellow>
 +As I mentioned, Lava is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. 
 +</hi>
 +
 +<hi orange>
 +Also, remind you that, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts invovled in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue.
 +</hi>
 +
 +
 +And this makes the job submission process for parallel jobs tedious.
 +
 +So click on **Back** and we'll detail that.
 +
 +\\
 +**[[cluster:32|Back]]**
cluster/31.txt ยท Last modified: 2007/04/19 15:45 (external edit)