User Tools

Site Tools


cluster:31


Back

OpenMPI ENV

Tests

To test your environment execute the following two binaries and compare the output. It should all be set up for you already. If not, contact the HPCadmin.

#1

[hmeij@swallowtail ~]$ /share/apps/bin/hello.run
Running on ilogin1 and ilogin2 with -np=16
Hello, world, I am 0 of 16
Hello, world, I am 11 of 16
Hello, world, I am 1 of 16
Hello, world, I am 2 of 16
Hello, world, I am 3 of 16
Hello, world, I am 4 of 16
Hello, world, I am 5 of 16
Hello, world, I am 6 of 16
Hello, world, I am 7 of 16
Hello, world, I am 8 of 16
Hello, world, I am 9 of 16
Hello, world, I am 10 of 16
Hello, world, I am 12 of 16
Hello, world, I am 13 of 16
Hello, world, I am 14 of 16
Hello, world, I am 15 of 16

#2

[hmeij@swallowtail ~]$ /share/apps/bin/cpi.run
Running on ilogin1 and ilogin2 with -np=16
Process 10 on compute-1-16.local
Process 0 on compute-1-15.local
Process 2 on compute-1-15.local
Process 3 on compute-1-15.local
Process 4 on compute-1-15.local
Process 5 on compute-1-15.local
Process 6 on compute-1-15.local
Process 7 on compute-1-15.local
Process 1 on compute-1-15.local
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.166646
Process 8 on compute-1-16.local
Process 9 on compute-1-16.local
Process 11 on compute-1-16.local
Process 12 on compute-1-16.local
Process 13 on compute-1-16.local
Process 14 on compute-1-16.local
Process 15 on compute-1-16.local

done. For those that are interested, below is the what & where of OpenMPI on our cluster.

OpenMPI

install directory: /share/apps/openmpi-1.2

… you can add the bin/ subdirectory to your path if you want. Not really necessary as long as you provide the full path to the binaries in your scripts.

The two scripts hello.run and cpi.run invoke the mpirun binary, like so

#!/bin/bash

echo Running on ilogin1 and ilogin2 with -np=16

/share/apps/openmpi-1.2/bin/mpirun -np 16 \
  -machinefile /share/apps/openmpi-1.2/bin/hello.machines \
  /share/apps/openmpi-1.2/bin/hello

The two binaries have libraries linked in, like so

[hmeij@swallowtail ~]# ldd /share/apps/openmpi-1.2/bin/hello
        libmpi.so.0 => /share/apps/openmpi-1.2/lib/libmpi.so.0 (0x0000002a95557000)
        libopen-rte.so.0 => /share/apps/openmpi-1.2/lib/libopen-rte.so.0 (0x0000002a956eb000)
        libopen-pal.so.0 => /share/apps/openmpi-1.2/lib/libopen-pal.so.0 (0x0000002a95844000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003684000000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003686d00000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003688600000)
        libm.so.6 => /lib64/tls/libm.so.6 (0x0000003683e00000)
        libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003684400000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x0000003683b00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003683900000)

Compiling

When you compile for example C code for OpenMPI

/share/apps/openmpi-1.2/bin/mpicc -o ./mpi /share/apps/openmpi-1.2/bin/cpi.c

check that the create binary finds all the libraries with ldd (see output above)

The Problem

Once you have your binary compiled, you can execute it on the head node or any other as described above. But the hello.run and cpi.run programs point the mpirun program to a hardcoded “machines” file.

This will not work when submitting your program to bsub. Platform reports:

<hi yellow> As I mentioned, Lava is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. </hi>

<hi orange> Also, remind you that, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts invovled in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue. </hi>

And this makes the job submission process for parallel jobs tedious.

So click on Back and we'll detail that.


Back

cluster/31.txt · Last modified: 2007/04/19 19:45 (external edit)