Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:42 [DokuWiki]

User Tools

Site Tools


cluster:42

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1130

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1134

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1137

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1138

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1164

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1168

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1171

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1172

Warning: Undefined array key 0 in /usr/share/dokuwiki/inc/ChangeLog/ChangeLog.php on line 345

Warning: Undefined array key 1 in /usr/share/dokuwiki/inc/html.php on line 1453

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1454

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:42 [2007/08/09 16:48]
cluster:42 [2007/08/09 16:48] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:0|Back]]**
  
 +
 +
 +
 +⇒ This is page 2 of 3, navigation provided at bottom of page
 +
 +
 +
 +
 +===== Switch & MPI Flavors =====
 +
 +As you can see in **[[cluster:41|the GalaxSee example]]**, parallel code has the ability to provide  significant speed up in job processing times.  Until some saturation point is achieved when performance takes a hit because of the excessive time spend on passing messages.
 +
 +What we'd like to know next is: 
 +
 +  - when should we consider the hardware switch in question?
 +  - which flavor of MPI should we use?
 +  - does any of it matter?
 +
 +The Galaxsee example invoked OpenMPI (compiled with the TopSpin libraries) running across nodes on the gigabit ethernet switch (GigE). Executables built against OpenMPI can run over either the IB or GigE switch. The default is to use InfiniBand (IB) if the interfaces are active. In the absence of an IB interconnect, OpenMPI will use GigE. The following message implies such a situation:
 +
 +<code>
 +--------------------------------------------------------------------------
 +[0,1,0]: MVAPI on host nfs-2-4 was unable to find any HCAs.
 +Another transport will be used instead, although this may result in 
 +lower performance.
 +--------------------------------------------------------------------------
 +--------------------------------------------------------------------------
 +[0,1,2]: MVAPI on host nfs-2-4 was unable to find any HCAs.
 +Another transport will be used instead, although this may result in 
 +lower performance.
 +--------------------------------------------------------------------------
 +...
 +</code>
 +
 +Lets run a chemistry example using Amber9 with sander.MPI.  We'll run two of the benchmarks programs that come with Amber ''jac'' and ''factor_ix'' Not sure what they do but that does not matter. We'll also design a different script to drive the amber examples.  Code is inserted to detect if a particular node is infiniband enabled.
 +
 +Our cluster contains TopSpin infiniband libraries (from Cisco) specific to our Cisco switch.  They are located in ''**/usr/local/topspin**'' In order to test different MPI flavors, OpenMPI has  been compiled, this time with Intel's icc compiler and, at configure time, was pointed to use those same infiniband libraries.  So this version of OpenMPI has both GigE and IB libraries  ... it is in ''**/share/apps/openmpi-1.2_intel**''
 +
 +<code>
 +# intel compilers + topspin
 +./configure --prefix /share/apps/openmpi-1.2_intel \
 +        CC=icc CXX=icpc F77=ifort FC=ifort \
 +        --disable-shared --enable-static \
 +        --with-mvapi=/usr/local/topspin \
 +        --with-mvapi-libdir=/usr/local/topspin/lib64
 +</code>
 +
 +OpenMPI was also recompiled against gcc/g95 without any references to the IB libraries.  It is located in ''**/share/apps/openmpi-1.2**''.
 +
 +We have 3 versions of Amber 8-)
 +
 +#1 Amber was compiled (with icc/ifort) against the TopSpin installation by specifying the following in config.h and was installed in **''/share/apps/amber/9/exe/''** ... so this version will always use the infiniband libraries.  Jobs should be submitted to the **''16-ilwnodes''** queue. Contains ''pmemd''.
 +
 +<code>
 +LOAD= ifort   $(LOCALFLAGS) $(AMBERBUILDFLAGS)
 +...
 +LOADLIB= -L/usr/local/topspin/mpi/mpich/lib64 
 +             -lmpichf90_i -lmpich_i -lmpichfsup_i 
 +             -lmpich -lvapi -lmosal -lpthread -ldl  
 +         -L/share/apps/intel/cmkl/9.1/lib/em64t 
 +             -lvml -lmkl_lapack -lmkl -lguide -lpthread
 +</code>
 +
 +#2 Amber was compiled (with icc/ifort) against the OpenMPI installation (described above) by specifying the following in config.h and was installed in **''/share/apps/amber/9openmpi/exe/''**.  This is the IB/GigE //"aware"// OpenMPI.  Jobs can be submitted to any queue since all nodes are GigE-enabled.  Does not contain ''pmemd''.
 +
 +<code>
 +LOAD= mpif90   $(LOCALFLAGS) $(AMBERBUILDFLAGS)
 +(resolves to /share/apps/openmpi-1.2_intel/bin/mpif90)
 +...
 +LOADLIB= -L/usr/local/topspin/lib64 
 +         -L/share/apps/openmpi-1.2_intel/lib 
 +               -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte 
 +               -lopen-pal -lvapi -lmosal -lrt -lnuma 
 +               -ldl -Wl,--export-dynamic -lnsl -lutil 
 +         -L/share/apps/intel/cmkl/9.1.021/lib/em64t 
 +               -lvml -lmkl_lapack -lmkl -lguide -lpthread
 +</code>
 +
 +#3 Amber was again compiled (with gcc/g95) against the "plain" OpenMPI installation without any references to the infiniband libraries and was  installed in **''/share/apps/amber/9plain/exe/''**.  Jobs can be submitted to any queue, but if submitted to the ''16-ilwnodes'' queue, the GigE interface will be used, not the IB interface. Does not contain ''pemed''.
 +
 +Complicated enough? 
 +
 +
 +
 +===== Test Runs =====
 +
 +Lets do some test runs.  There will be some noise here as we're running against nodes that are doing work but we'll avoid heavily loaded nodes. For runs with less than 8 cores per request we run against an idle host to get good baseline results.  In the script we'll change the appropriate code block setting the appropriate references to **''sander.MPI''** and MPI flavor/Queue.
 +
 +<code>
 +
 +#!/bin/bash
 +
 +#BSUB -q 16-ilwnodes
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +
 +# change next 2 lines
 +#BSUB -n 8
 +NP=8
 +
 +# gcc/g95 compiled sander + GigE only openmpi 
 +#SANDER="/share/apps/amber/9plain/exe/sander.MPI"
 +#MPIRUN="/share/apps/bin/mpich-mpirun.openmpi"
 +#PATH=/share/apps/openmpi-1.2/bin:$PATH
 +
 +# intel compiled sander + IB/GigE openmpi
 +#export LD_LIBRARY_PATH=/share/apps/intel/cmkl/9.1.021/lib/em64t:$LD_LIBRARY_PATH
 +#SANDER="/share/apps/amber/9openmpi/exe/sander.MPI"
 +#MPIRUN="/share/apps/bin/mpich-mpirun.openmpi_intel"
 +#PATH=/share/apps/openmpi-1.2_intel/bin:$PATH
 +
 +# intel compiled sander + infiniband topspin
 +SANDER="/share/apps/amber/9/exe/sander.MPI"
 +MPIRUN="/share/apps/bin/mpich-mpirun.topspin"
 +PATH=/usr/local/topspin/mpi/mpich/bin:$PATH
 +
 +# scratch dirs
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +
 +rm -rf err out logfile mdout restrt mdinfo
 +cd $MYSANSCRATCH
 +export PATH
 +
 +# which interconnects, pick IB if present
 +TEST='INACTIVE'
 +if [ -f /proc/topspin/core/ca1/port1/info ]; then
 +        TEST="`grep ACTIVE /proc/topspin/core/ca1/port1/info | \
 +               awk -F: '{print $2}' | sed 's/\s//g'`"
 +fi
 +if [ $TEST == 'ACTIVE' ]; then
 +        echo "Running on IB-enabled node set"
 +        DO=$MPIRUN
 +else
 +        echo "Running on GigE-enabled node set"
 +        # eth1, the nfs switch
 +        DO="$MPIRUN --mca btl ^mvapi --mca btl_tcp_if_include eth0"
 +fi
 +
 +# jac bench
 +cp /home/hmeij/1g6r/test/mdin.jac ./mdin
 +cp /share/apps/amber/9/benchmarks/jac/inpcrd.equil .
 +cp /share/apps/amber/9/benchmarks/jac/prmtop .
 +time $DO -np $NP $SANDER -O -i mdin -c inpcrd.equil -o mdout < /dev/null
 +cp ./mdout /home/hmeij/1g6r/test/mdout.jac
 +
 +# factor_ix bench
 +cp /home/hmeij/1g6r/test/mdin.ix ./mdin
 +cp /share/apps/amber/9/benchmarks/factor_ix/inpcrd .
 +cp /share/apps/amber/9/benchmarks/factor_ix/prmtop .
 +time $DO -np $NP $SANDER -O -i mdin -o mdout < /dev/null
 +cp ./mdout /home/hmeij/1g6r/test/mdout.ix
 +
 +</code>
 +
 +
 +
 +
 +
 +
 +
 +
 +===== Results =====
 +
 +Puzzling ... or maybe not.  
 +
 +^  single host compute-1-2 (IB/GigE enabled)  ^^^^^^
 +^  q: test  ^^  q: test  ^^  q: test  ^^
 +^  l: OpenMPI (GigE)  ^^  l: OpenMPI Intel (GigE - IB)  ^^  l: TopSpin (IB)  ^^
 +^  -np  ^  time  ^  -np  ^  time  ^  -np  ^  time  ^
 +|  02  |  10m33s  |  02  |  7m41s - 8m27s  |  02  |  7m49s  |
 +|  04  |  5m22s  |  04  |  4m42s - 4m35s  |  04  |  5m52s  |
 +|  08  |  3m40s  |  08  |  2m58s - 2m56s  |  08  |  3m13s  |
 +
 +Perhaps our problem is not complex enough to show differences amongst the different MPI and switch options.  On the other hand, with less than or equal to 8 cores, all the message passing is done on a single node across 2 processors.  Maybe that implies equal speed up for any combination since it's never using the IB or GigE interfaces. Deh.
 +
 +Next lets ask for enough cores so that we invoke multiple nodes (the host count below) and thus the appropriate interface.  That should yield some differences.  
 +
 +
 +
 +^  switching to larger core requests  ^^^^^^
 +^  q: 16-lwnodes  ^^  q: 16-ilwnodes  ^^  q: 16-ilwnodes  ^^
 +^  l: OpenMPI (GigE)  ^^  l: OpenMPI Intel (GigE - IB)  ^^  l: TopSpin (IB)  ^^
 +^  -np (hosts)  ^  time  ^  -np  ^  time (hosts)  ^  -np (hosts)  ^  time  ^
 +|  16  |  18m16s(4)    16  |  18m37s(3) - 02m01s(3)  |  16  |  02m08s(3)   |
 +|  32  |  17m19s(7)    32  |  18m24s(6) - 01m34s(7)  |  32  |  01m29s(7)   |
 +
 +We now observe the dramatic increase in performance for the infiniband switch.  We were expecting, from some online reading, a speedup of a factor of 3-4.  
 +
 +So you could run your Amber parallel job across the GigE-enabled nodes.  It will just run relatively slowly, depending on your volume of message passing. If users need more than 32 cores and the 16-ilwnodes queue is busy, you could go to the idle queue (all nodes) and run your job across GigE & IB with OpenMPI. Adjust the script accordingly so it does not test for the IB interfaces. Just for fun, a 64 core request below.
 +
 +^  switching to idle queue  ^^^^^^
 +^  l: OpenMPI (GigE)  ^^  l: OpenMPI Intel (GigE only)  ^^  l: TopSpin (IB)  ^^
 +^  -np (hosts)  ^  time  ^  -np  ^  time (hosts)  ^  -np (hosts)  ^  time  ^
 +|  64  |  DNK    64  |  22m20s(11)  |  64  |  DNK   |
 +
 +In the end it appears that picking a flavor of MPI does not have any impact.  Running Infiniband libraries does make a huge difference, as expected.
 +
 +
 +=> go to [[cluster:43|page 3 of 3]] of __what is parallel computing ?__
 +
 +
 +\\
 +**[[cluster:0|Back]]**
cluster/42.txt · Last modified: 2007/08/09 16:48 (external edit)