Table of Contents


Back

⇒ This is page 2 of 3, navigation provided at bottom of page

Switch & MPI Flavors

As you can see in the GalaxSee example, parallel code has the ability to provide significant speed up in job processing times. Until some saturation point is achieved when performance takes a hit because of the excessive time spend on passing messages.

What we'd like to know next is:

  1. when should we consider the hardware switch in question?
  2. which flavor of MPI should we use?
  3. does any of it matter?

The Galaxsee example invoked OpenMPI (compiled with the TopSpin libraries) running across nodes on the gigabit ethernet switch (GigE). Executables built against OpenMPI can run over either the IB or GigE switch. The default is to use InfiniBand (IB) if the interfaces are active. In the absence of an IB interconnect, OpenMPI will use GigE. The following message implies such a situation:

--------------------------------------------------------------------------
[0,1,0]: MVAPI on host nfs-2-4 was unable to find any HCAs.
Another transport will be used instead, although this may result in 
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,2]: MVAPI on host nfs-2-4 was unable to find any HCAs.
Another transport will be used instead, although this may result in 
lower performance.
--------------------------------------------------------------------------
...

Lets run a chemistry example using Amber9 with sander.MPI. We'll run two of the benchmarks programs that come with Amber jac and factor_ix. Not sure what they do but that does not matter. We'll also design a different script to drive the amber examples. Code is inserted to detect if a particular node is infiniband enabled.

Our cluster contains TopSpin infiniband libraries (from Cisco) specific to our Cisco switch. They are located in /usr/local/topspin. In order to test different MPI flavors, OpenMPI has been compiled, this time with Intel's icc compiler and, at configure time, was pointed to use those same infiniband libraries. So this version of OpenMPI has both GigE and IB libraries … it is in /share/apps/openmpi-1.2_intel

# intel compilers + topspin
./configure --prefix /share/apps/openmpi-1.2_intel \
        CC=icc CXX=icpc F77=ifort FC=ifort \
        --disable-shared --enable-static \
        --with-mvapi=/usr/local/topspin \
        --with-mvapi-libdir=/usr/local/topspin/lib64

OpenMPI was also recompiled against gcc/g95 without any references to the IB libraries. It is located in /share/apps/openmpi-1.2.

We have 3 versions of Amber 8-)

#1 Amber was compiled (with icc/ifort) against the TopSpin installation by specifying the following in config.h and was installed in /share/apps/amber/9/exe/ … so this version will always use the infiniband libraries. Jobs should be submitted to the 16-ilwnodes queue. Contains pmemd.

LOAD= ifort   $(LOCALFLAGS) $(AMBERBUILDFLAGS)
...
LOADLIB= -L/usr/local/topspin/mpi/mpich/lib64 
             -lmpichf90_i -lmpich_i -lmpichfsup_i 
             -lmpich -lvapi -lmosal -lpthread -ldl  
         -L/share/apps/intel/cmkl/9.1/lib/em64t 
             -lvml -lmkl_lapack -lmkl -lguide -lpthread

#2 Amber was compiled (with icc/ifort) against the OpenMPI installation (described above) by specifying the following in config.h and was installed in /share/apps/amber/9openmpi/exe/. This is the IB/GigE “aware” OpenMPI. Jobs can be submitted to any queue since all nodes are GigE-enabled. Does not contain pmemd.

LOAD= mpif90   $(LOCALFLAGS) $(AMBERBUILDFLAGS)
(resolves to /share/apps/openmpi-1.2_intel/bin/mpif90)
...
LOADLIB= -L/usr/local/topspin/lib64 
         -L/share/apps/openmpi-1.2_intel/lib 
               -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte 
               -lopen-pal -lvapi -lmosal -lrt -lnuma 
               -ldl -Wl,--export-dynamic -lnsl -lutil 
         -L/share/apps/intel/cmkl/9.1.021/lib/em64t 
               -lvml -lmkl_lapack -lmkl -lguide -lpthread

#3 Amber was again compiled (with gcc/g95) against the “plain” OpenMPI installation without any references to the infiniband libraries and was installed in /share/apps/amber/9plain/exe/. Jobs can be submitted to any queue, but if submitted to the 16-ilwnodes queue, the GigE interface will be used, not the IB interface. Does not contain pemed.

Complicated enough?

Test Runs

Lets do some test runs. There will be some noise here as we're running against nodes that are doing work but we'll avoid heavily loaded nodes. For runs with less than 8 cores per request we run against an idle host to get good baseline results. In the script we'll change the appropriate code block setting the appropriate references to sander.MPI and MPI flavor/Queue.

#!/bin/bash

#BSUB -q 16-ilwnodes
#BSUB -J test
#BSUB -o out
#BSUB -e err

# change next 2 lines
#BSUB -n 8
NP=8

# gcc/g95 compiled sander + GigE only openmpi 
#SANDER="/share/apps/amber/9plain/exe/sander.MPI"
#MPIRUN="/share/apps/bin/mpich-mpirun.openmpi"
#PATH=/share/apps/openmpi-1.2/bin:$PATH

# intel compiled sander + IB/GigE openmpi
#export LD_LIBRARY_PATH=/share/apps/intel/cmkl/9.1.021/lib/em64t:$LD_LIBRARY_PATH
#SANDER="/share/apps/amber/9openmpi/exe/sander.MPI"
#MPIRUN="/share/apps/bin/mpich-mpirun.openmpi_intel"
#PATH=/share/apps/openmpi-1.2_intel/bin:$PATH

# intel compiled sander + infiniband topspin
SANDER="/share/apps/amber/9/exe/sander.MPI"
MPIRUN="/share/apps/bin/mpich-mpirun.topspin"
PATH=/usr/local/topspin/mpi/mpich/bin:$PATH

# scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID

rm -rf err out logfile mdout restrt mdinfo
cd $MYSANSCRATCH
export PATH

# which interconnects, pick IB if present
TEST='INACTIVE'
if [ -f /proc/topspin/core/ca1/port1/info ]; then
        TEST="`grep ACTIVE /proc/topspin/core/ca1/port1/info | \
               awk -F: '{print $2}' | sed 's/\s//g'`"
fi
if [ $TEST == 'ACTIVE' ]; then
        echo "Running on IB-enabled node set"
        DO=$MPIRUN
else
        echo "Running on GigE-enabled node set"
        # eth1, the nfs switch
        DO="$MPIRUN --mca btl ^mvapi --mca btl_tcp_if_include eth0"
fi

# jac bench
cp /home/hmeij/1g6r/test/mdin.jac ./mdin
cp /share/apps/amber/9/benchmarks/jac/inpcrd.equil .
cp /share/apps/amber/9/benchmarks/jac/prmtop .
time $DO -np $NP $SANDER -O -i mdin -c inpcrd.equil -o mdout < /dev/null
cp ./mdout /home/hmeij/1g6r/test/mdout.jac

# factor_ix bench
cp /home/hmeij/1g6r/test/mdin.ix ./mdin
cp /share/apps/amber/9/benchmarks/factor_ix/inpcrd .
cp /share/apps/amber/9/benchmarks/factor_ix/prmtop .
time $DO -np $NP $SANDER -O -i mdin -o mdout < /dev/null
cp ./mdout /home/hmeij/1g6r/test/mdout.ix

Results

Puzzling … or maybe not.

single host compute-1-2 (IB/GigE enabled)
q: test q: test q: test
l: OpenMPI (GigE) l: OpenMPI Intel (GigE - IB) l: TopSpin (IB)
-np time -np time -np time
02 10m33s 02 7m41s - 8m27s 02 7m49s
04 5m22s 04 4m42s - 4m35s 04 5m52s
08 3m40s 08 2m58s - 2m56s 08 3m13s

Perhaps our problem is not complex enough to show differences amongst the different MPI and switch options. On the other hand, with less than or equal to 8 cores, all the message passing is done on a single node across 2 processors. Maybe that implies equal speed up for any combination since it's never using the IB or GigE interfaces. Deh.

Next lets ask for enough cores so that we invoke multiple nodes (the host count below) and thus the appropriate interface. That should yield some differences.

switching to larger core requests
q: 16-lwnodes q: 16-ilwnodes q: 16-ilwnodes
l: OpenMPI (GigE) l: OpenMPI Intel (GigE - IB) l: TopSpin (IB)
-np (hosts) time -np time (hosts) -np (hosts) time
16 18m16s(4) 16 18m37s(3) - 02m01s(3) 16 02m08s(3)
32 17m19s(7) 32 18m24s(6) - 01m34s(7) 32 01m29s(7)

We now observe the dramatic increase in performance for the infiniband switch. We were expecting, from some online reading, a speedup of a factor of 3-4.

So you could run your Amber parallel job across the GigE-enabled nodes. It will just run relatively slowly, depending on your volume of message passing. If users need more than 32 cores and the 16-ilwnodes queue is busy, you could go to the idle queue (all nodes) and run your job across GigE & IB with OpenMPI. Adjust the script accordingly so it does not test for the IB interfaces. Just for fun, a 64 core request below.

switching to idle queue
l: OpenMPI (GigE) l: OpenMPI Intel (GigE only) l: TopSpin (IB)
-np (hosts) time -np time (hosts) -np (hosts) time
64 DNK 64 22m20s(11) 64 DNK

In the end it appears that picking a flavor of MPI does not have any impact. Running Infiniband libraries does make a huge difference, as expected.

⇒ go to page 3 of 3 of what is parallel computing ?


Back