To test your environment execute the following two binaries and compare the output. It should all be set up for you already. If not, contact the HPCadmin.
[hmeij@swallowtail ~]$ /share/apps/bin/hello.run Running on ilogin1 and ilogin2 with -np=16 Hello, world, I am 0 of 16 Hello, world, I am 11 of 16 Hello, world, I am 1 of 16 Hello, world, I am 2 of 16 Hello, world, I am 3 of 16 Hello, world, I am 4 of 16 Hello, world, I am 5 of 16 Hello, world, I am 6 of 16 Hello, world, I am 7 of 16 Hello, world, I am 8 of 16 Hello, world, I am 9 of 16 Hello, world, I am 10 of 16 Hello, world, I am 12 of 16 Hello, world, I am 13 of 16 Hello, world, I am 14 of 16 Hello, world, I am 15 of 16
[hmeij@swallowtail ~]$ /share/apps/bin/cpi.run Running on ilogin1 and ilogin2 with -np=16 Process 10 on compute-1-16.local Process 0 on compute-1-15.local Process 2 on compute-1-15.local Process 3 on compute-1-15.local Process 4 on compute-1-15.local Process 5 on compute-1-15.local Process 6 on compute-1-15.local Process 7 on compute-1-15.local Process 1 on compute-1-15.local pi is approximately 3.1416009869231245, Error is 0.0000083333333314 wall clock time = 0.166646 Process 8 on compute-1-16.local Process 9 on compute-1-16.local Process 11 on compute-1-16.local Process 12 on compute-1-16.local Process 13 on compute-1-16.local Process 14 on compute-1-16.local Process 15 on compute-1-16.local
done. For those that are interested, below is the what & where of OpenMPI on our cluster.
… you can add the
bin/ subdirectory to your path if you want. Not really necessary as long as you provide the full path to the binaries in your scripts.
The two scripts
cpi.run invoke the
mpirun binary, like so
#!/bin/bash echo Running on ilogin1 and ilogin2 with -np=16 /share/apps/openmpi-1.2/bin/mpirun -np 16 \ -machinefile /share/apps/openmpi-1.2/bin/hello.machines \ /share/apps/openmpi-1.2/bin/hello
The two binaries have libraries linked in, like so
[hmeij@swallowtail ~]# ldd /share/apps/openmpi-1.2/bin/hello libmpi.so.0 => /share/apps/openmpi-1.2/lib/libmpi.so.0 (0x0000002a95557000) libopen-rte.so.0 => /share/apps/openmpi-1.2/lib/libopen-rte.so.0 (0x0000002a956eb000) libopen-pal.so.0 => /share/apps/openmpi-1.2/lib/libopen-pal.so.0 (0x0000002a95844000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003684000000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003686d00000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003688600000) libm.so.6 => /lib64/tls/libm.so.6 (0x0000003683e00000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003684400000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000003683b00000) /lib64/ld-linux-x86-64.so.2 (0x0000003683900000)
When you compile for example C code for OpenMPI
/share/apps/openmpi-1.2/bin/mpicc -o ./mpi /share/apps/openmpi-1.2/bin/cpi.c
check that the create binary finds all the libraries with
ldd (see output above)
Once you have your binary compiled, you can execute it on the head node or any other as described above. But the
cpi.run programs point the
mpirun program to a hardcoded “machines” file.
This will not work when submitting your program to
bsub. Platform reports:
<hi yellow> As I mentioned, Lava is not natively capable for parallel jobs, so you will have to write your own integration script to parse the hosts allocated by LSF (with LSB_HOSTS variable) and integrate them to your MPI distribution. </hi>
<hi orange> Also, remind you that, because the lack of LSF's parallel support daemons, these scripts can only provide a loose integration to Lava. Specifically, Lava only knows the mpirun process on the first host; not knowledge to other parallel processes in other hosts invovled in a paralell job. So if, in some circumstances, a parallel job fails, Lava cannot clean up the leftover processes, for example, mpich 1's shared-memory leftovers. You may want to regularly checks on your cluster on this issue. </hi>
And this makes the job submission process for parallel jobs tedious.
So click on Back and we'll detail that.