User Tools

Site Tools


cluster:31

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:31 [2007/04/19 15:45] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​32|Back]]**
  
 +====== OpenMPI ENV ======
 +
 +
 +===== Tests =====
 +
 +To test your environment execute the following two binaries and compare the output. ​ It should all be set up for you already. ​ If not, contact the HPCadmin.
 +
 +**#1**
 +<​code>​
 +[hmeij@swallowtail ~]$ /​share/​apps/​bin/​hello.run
 +Running on ilogin1 and ilogin2 with -np=16
 +Hello, world, I am 0 of 16
 +Hello, world, I am 11 of 16
 +Hello, world, I am 1 of 16
 +Hello, world, I am 2 of 16
 +Hello, world, I am 3 of 16
 +Hello, world, I am 4 of 16
 +Hello, world, I am 5 of 16
 +Hello, world, I am 6 of 16
 +Hello, world, I am 7 of 16
 +Hello, world, I am 8 of 16
 +Hello, world, I am 9 of 16
 +Hello, world, I am 10 of 16
 +Hello, world, I am 12 of 16
 +Hello, world, I am 13 of 16
 +Hello, world, I am 14 of 16
 +Hello, world, I am 15 of 16
 +</​code>​
 +
 +**#2**
 +<​code>​
 +[hmeij@swallowtail ~]$ /​share/​apps/​bin/​cpi.run
 +Running on ilogin1 and ilogin2 with -np=16
 +Process 10 on compute-1-16.local
 +Process 0 on compute-1-15.local
 +Process 2 on compute-1-15.local
 +Process 3 on compute-1-15.local
 +Process 4 on compute-1-15.local
 +Process 5 on compute-1-15.local
 +Process 6 on compute-1-15.local
 +Process 7 on compute-1-15.local
 +Process 1 on compute-1-15.local
 +pi is approximately 3.1416009869231245,​ Error is 0.0000083333333314
 +wall clock time = 0.166646
 +Process 8 on compute-1-16.local
 +Process 9 on compute-1-16.local
 +Process 11 on compute-1-16.local
 +Process 12 on compute-1-16.local
 +Process 13 on compute-1-16.local
 +Process 14 on compute-1-16.local
 +Process 15 on compute-1-16.local
 +</​code>​
 +
 +done. For those that are interested, below is the what & where of OpenMPI on our cluster.
 +
 +
 +
 +
 +===== OpenMPI =====
 +
 +install directory: ''/​share/​apps/​openmpi-1.2''​
 +
 +... you can add the ''​bin/''​ subdirectory to your path if you want.  Not really necessary as long as you provide the full path to the binaries in your scripts.
 +
 +The two scripts ''​hello.run''​ and ''​cpi.run''​ invoke the ''​mpirun''​ binary, like so
 +
 +<​code>​
 +#!/bin/bash
 +
 +echo Running on ilogin1 and ilogin2 with -np=16
 +
 +/​share/​apps/​openmpi-1.2/​bin/​mpirun -np 16 \
 +  -machinefile /​share/​apps/​openmpi-1.2/​bin/​hello.machines \
 +  /​share/​apps/​openmpi-1.2/​bin/​hello
 +</​code>​
 +
 +The two binaries have libraries linked in, like so
 +
 +<​code>​
 +[hmeij@swallowtail ~]# ldd /​share/​apps/​openmpi-1.2/​bin/​hello
 +        libmpi.so.0 => /​share/​apps/​openmpi-1.2/​lib/​libmpi.so.0 (0x0000002a95557000)
 +        libopen-rte.so.0 => /​share/​apps/​openmpi-1.2/​lib/​libopen-rte.so.0 (0x0000002a956eb000)
 +        libopen-pal.so.0 => /​share/​apps/​openmpi-1.2/​lib/​libopen-pal.so.0 (0x0000002a95844000)
 +        libdl.so.2 => /​lib64/​libdl.so.2 (0x0000003684000000)
 +        libnsl.so.1 => /​lib64/​libnsl.so.1 (0x0000003686d00000)
 +        libutil.so.1 => /​lib64/​libutil.so.1 (0x0000003688600000)
 +        libm.so.6 => /​lib64/​tls/​libm.so.6 (0x0000003683e00000)
 +        libpthread.so.0 => /​lib64/​tls/​libpthread.so.0 (0x0000003684400000)
 +        libc.so.6 => /​lib64/​tls/​libc.so.6 (0x0000003683b00000)
 +        /​lib64/​ld-linux-x86-64.so.2 (0x0000003683900000)
 +</​code>​
 +
 +===== Compiling =====
 +
 +When you compile for example C code for OpenMPI
 +
 +<​code>​
 +/​share/​apps/​openmpi-1.2/​bin/​mpicc -o ./mpi /​share/​apps/​openmpi-1.2/​bin/​cpi.c
 +</​code>​
 +
 +check that the create binary finds all the libraries with ''​ldd''​ (see output above)
 +
 +
 +
 +
 +===== The Problem =====
 +
 +Once you have your binary compiled, you can execute it on the head node or any other as described above. ​ But the ''​hello.run''​ and ''​cpi.run''​ programs point the ''​mpirun''​ program to a hardcoded "​machines"​ file.
 +
 +This will not work when submitting your program to ''​bsub''​. ​ Platform reports:
 +
 +<hi yellow>
 +As I mentioned, Lava is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. ​
 +</hi>
 +
 +<hi orange>
 +Also, remind you that, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically,​ Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts invovled in a paralell job. So if, in some circumstances,​ a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue.
 +</hi>
 +
 +
 +And this makes the job submission process for parallel jobs tedious.
 +
 +So click on **Back** and we'll detail that.
 +
 +\\
 +**[[cluster:​32|Back]]**
cluster/31.txt ยท Last modified: 2007/04/19 15:45 (external edit)