User Tools

Site Tools


cluster:26

Warning: Undefined array key 0 in /usr/share/dokuwiki/inc/html.php on line 1271

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1164

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1168

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1171

Warning: Trying to access array offset on value of type bool in /usr/share/dokuwiki/inc/html.php on line 1172

Warning: Undefined array key 0 in /usr/share/dokuwiki/inc/ChangeLog/ChangeLog.php on line 345

Warning: Undefined array key 1 in /usr/share/dokuwiki/inc/html.php on line 1453

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1454

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:26 [2007/04/19 15:28] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:23|Back]]**
  
 +The production copy of OpenMPI is in ''/share/apps/openmpi-1.2''.\\
 + --- //[[hmeij@wesleyan.edu|Henk Meij]] 2007/04/19 15:27//
 +
 +====== HPLinpack Runs ======
 +
 +The purpose here is to rerun the HPLinpack benchmarks Amol ran while configuring the cluster.  
 +
 +^Before^
 +|{{:cluster:hplburn_before.gif|Idle!}}|
 +^During^
 +|{{:cluster:hplburn_during.gif|Heat!}}|
 +^Ooops^
 +|{{:cluster:hplburn.gif|Burn!}}|
 +
 +FAQ [[http://www.netlib.org/benchmark/hpl/faqs.html|External Link]]
 +
 +
 +
 +
 +====== Problem Sizes ======
 +
 +N calculation, for example: \\
 +4 nodes, 4 gb each is 16 gb total which yields 2 gb double precision (8 byte) elements ... 2gb is 2*1024*1024*1024 = 2,147,395,600 ... take the square root of that and round 46,340 ... 80% of that is 37072\\
 +
 +N calculation 16 nodes (infiniband or ethernet):\\
 +16 nodes, 4 gb each is 64 gb total which yields 8 gb double precision (8 byte) elements ... 8gb is 8*1024*1024*1024 = 8,589,934,592 ... take the square root of that and round 92,681 ... 80% of that is 74145
 +
 +N calculation 4 heavy weight nodes:\\
 +4 nodes, 16 gb each is 64 gb total which yields 8 gb double precision (8 byte) elements ... 8gb is 8*1024*1024*1024 = 8,589,934,592 ... take the square root of that and round 92,681 ... 80% of that is 74145
 +
 +NB calculations: \\
 +range of 32...256\\
 +ood starting values are 88 132
 +
 +PxQ Grid:\\
 +max value PxQ should equal nr of cores \\
 +P<Q ... close for infiniband but P much smaller than Q for ethernet
 +
 +LWNi (np=128): P=8, Q=16\\
 +LWNe (np=128): P=4, Q=32 or P=2, Q=64\\
 +HWN (np=32): P=4, Q=8 or P=2, Q=16
 +
 +===== Infiniband (16 nodes) =====
 +
 +  * nodes: compute-1-1 thru compute-1-16
 +  * each dual quad 2.6 ghz PE1950 (2x4x16 totals 128 cores)
 +  * each with 4 gb ram (4x16=64 gb total memory)
 +
 +===== HPL.dat ======
 +
 +<code>
 +HPLinpack benchmark input file
 +Innovative Computing Laboratory, University of Tennessee
 +HPL.out      output file name (if any)
 +7            device out (6=stdout,7=stderr,file)
 +8            # of problems sizes (N)
 +1000 5000 10000 15000 20000 25000 30000 35000 Ns
 +6            # of NBs
 +200 300 400 500 600 700     NBs
 +0            PMAP process mapping (0=Row-,1=Column-major)
 +1            # of process grids (P x Q)
 +8            Ps
 +16           Qs
 +16.0         threshold
 +3            # of panel fact
 +0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
 +2            # of recursive stopping criterium
 +2 4          NBMINs (>= 1)
 +1            # of panels in recursion
 +2            NDIVs
 +3            # of recursive panel fact.
 +0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
 +1            # of broadcast
 +0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
 +1            # of lookahead depth
 +0            DEPTHs (>=0)
 +2            SWAP (0=bin-exch,1=long,2=mix)
 +64           swapping threshold
 +0            L1 in (0=transposed,1=no-transposed) form
 +0            U  in (0=transposed,1=no-transposed) form
 +1            Equilibration (0=no,1=yes)
 +8            memory alignment in double (> 0)
 +</code>
 +
 +===== run script ======
 +
 +**=> su delltest**
 +
 +<code>
 +#!/bin/bash
 +
 +echo "setting P4_GLOBMEMSIZE=10000000"
 +
 +P4_GLOBMEMSIZE=10000000
 +export P4_GLOBMEMSIZE
 +
 +echo "invoking..."
 +echo "/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 128 -hostfile ./machines ./xhpl"
 +
 +date > HPL.start
 +(/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 128 -hostfile ./machines ./xhpl > HPL.out 2>&1) 
 +</code>
 +
 +=> with above HPL.dat file, this configuration runs for 8 hours ...
 +
 +
 +===== Ethernet (16 nodes) =====
 +
 +  * nodes: compute-1-17 thru compute-2-32
 +  * each dual quad 2.6 ghz PE1950 (2x4x16 totals 128 cores)
 +  * each with 4 gb ram (4x16=64 gb total memory)
 +
 +===== HPL.dat ======
 +
 +<code>
 +Innovative Computing Laboratory, University of Tennessee
 +HPL.out      output file name (if any)
 +7            device out (6=stdout,7=stderr,file)
 +1             # of problems sizes (N)
 +74145 Ns
 +1             # of NBs
 +88 NBs
 +0            PMAP process mapping (0=Row-,1=Column-major)
 +1            # of process grids (P x Q)
 +4            Ps
 +32           Qs
 +16.0         threshold
 +3            # of panel fact
 +0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
 +2            # of recursive stopping criterium
 +2 4          NBMINs (>= 1)
 +1            # of panels in recursion
 +2            NDIVs
 +3            # of recursive panel fact.
 +0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
 +1            # of broadcast
 +0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
 +1            # of lookahead depth
 +0            DEPTHs (>=0)
 +2            SWAP (0=bin-exch,1=long,2=mix)
 +64           swapping threshold
 +0            L1 in (0=transposed,1=no-transposed) form
 +0            U  in (0=transposed,1=no-transposed) form
 +1            Equilibration (0=no,1=yes)
 +8            memory alignment in double (> 0)
 +</code>
 +
 +
 +
 +
 +
 +===== run script ======
 +
 +**=> su delltest2**
 +
 +<code>
 +#!/bin/bash
 +
 +echo "setting P4_GLOBMEMSIZE=10000000"
 +
 +P4_GLOBMEMSIZE=10000000
 +export P4_GLOBMEMSIZE
 +
 +date > HPL.start
 +
 +
 +echo "invoking..."
 +echo "/home/delltest2/openmpi-1.2/bin/mpirun -np 128 -machinefile 
 +/home/delltest2/machines /home/delltest2/xhpl"
 +
 +(/home/delltest2/openmpi-1.2/bin/mpirun -np 128 -machinefile 
 +/home/delltest2/machines home/delltest2/xhpl > /home/delltest2/HPL.out 2>&1)&
 +</code>
 +
 +=> runs for 4 hours ... change these lines below and it'll run for 14 hours
 +
 +<code>
 +2             # of problems sizes (N)
 +74145 74145 Ns
 +2             # of NBs
 +88 132 NBs
 +</code>
 +
 +
 +===== Ethernet (4 nodes) =====
 +
 +
 +  * nodes: nfs-2-1 thru nfs-2-4
 +  * each dual quad 2.6 ghz PE1950 (2x4x4 totals 32 cores)
 +  * each with 16 gb ram (16x4=64 gb total memory)
 +
 +===== HPL.dat ======
 +
 +<code>
 +Innovative Computing Laboratory, University of Tennessee
 +HPL.out      output file name (if any)
 +7            device out (6=stdout,7=stderr,file)
 +1            # of problems sizes (N)
 +74145 Ns
 +1            # of NBs
 +88 NBs
 +0            PMAP process mapping (0=Row-,1=Column-major)
 +1            # of process grids (P x Q)
 +4            Ps
 +8            Qs
 +16.0         threshold
 +3            # of panel fact
 +0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
 +2            # of recursive stopping criterium
 +2 4          NBMINs (>= 1)
 +1            # of panels in recursion
 +2            NDIVs
 +3            # of recursive panel fact.
 +0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
 +1            # of broadcast
 +0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
 +1            # of lookahead depth
 +0            DEPTHs (>=0)
 +2            SWAP (0=bin-exch,1=long,2=mix)
 +64           swapping threshold
 +0            L1 in (0=transposed,1=no-transposed) form
 +0            U  in (0=transposed,1=no-transposed) form
 +1            Equilibration (0=no,1=yes)
 +8            memory alignment in double (> 0)
 +</code>
 +
 +
 +
 +===== run script ======
 +
 +**=> su delltest3**
 +
 +<code>
 +#!/bin/bash
 +
 +echo "setting P4_GLOBMEMSIZE=10000000"
 +
 +P4_GLOBMEMSIZE=10000000
 +export P4_GLOBMEMSIZE
 +
 +date > HPL.start
 +
 +
 +echo "invoking..."
 +echo "/home/delltest3/openmpi-1.2/bin/mpirun -np 32 -machinefile 
 + /home/delltest3/machines /home/delltest3/xhpl"
 +
 +(/home/delltest3/openmpi-1.2/bin/mpirun -np 32 -machinefile /home/delltest3/machines
 + /home/delltest3/xhpl > /home/delltest3/HPL.out 2>&1)&
 +</code>
 +
 +=> runs for 14 1/2 hours ... change these lines and it'll run for 2 days
 +
 +<code>
 +2             # of NBs
 +88 132 NBs
 +</code>
 +
 +
 +===== MPIRUN-1.2 =====
 +
 +From Sili at Platform ...
 +
 +<code>
 +Actrually, MPICH 1 always has problems in the shared memory control. It really takes time to debug on the buggy shared memory stuff. I would rather suggest using openmpi instead of MPICH 1 to launch Ethernet linpack testings as openmpi is a newer and better MPI implementation than MPICH 1 and it is MPI-2 compatible plus it supports for both ethernet and infiniband devices.
 + 
 +The precedure I just tested is as follows
 +
 +1. Compile Openmpi
 +Here is the procedure I used to recompile openmpi :
 +# ./configure --prefix=/home/shuang/openmpi-1.2 --disable-mpi-f90
 +# make
 +# make install
 + 
 +To test the installation, create a host file. I generated a hostfile :
 +# cat /etc/hosts | grep compute | awk '{print $3} ' > machines
 + 
 +Then I recompiled the hello example (the hello_c.c file can be found at the examples directory on the untar'd source directory):
 +# /home/shuang/openmpi-1.2/bin/mpicc -o hello ./hello_c.c
 + 
 +And tested it :
 +# /home/shuang/openmpi-1.2/bin/mpirun -np 4 -machinefile machines --prefix /home/shuang/openmpi-1.2 ./hello
 + 
 +Please note that I used the complete path to the executables because by default, lam will be picked up. This is also why I used the --prefix option. You may want to use modules to load / unload these environment settings. Please let me know if you would like to have more information about this (open-source) software.
 + 
 +2. Compile Linpack with Openmpi
 + 
 +# wget http://www.netlib.org/benchmark/hpl/hpl.tgz
 +# tar zxf hpl.tgz
 +# cd hpl
 +# cp setup/Make.Linux_PII_CBLAS .
 +edit Make.Linux_PII_CBLAS, change "MPdir" to "/home/shuang/openmpi-1.2", change "MPlib" to "$(MPdir)/lib/libmpi.so", change "LAdir" to "/usr/lib64", change "CC" to "/home/shuang/openmpi-1.2/bin/mpicc", and change "LINKER" to "/home/shuang/openmpi-1.2/bin/mpicc".
 + 
 +Then you can make linpack by
 +# make arch=Linux_PII_CBLAS
 + 
 +To test it, edit the HPL.dat accordingly and run by:
 +# /home/shuang/openmpi-1.2/bin/mpirun -np 8 -machinefile machines --prefix /home/shuang/openmpi-1.2 ./xhpl
 +
 +</code>
 +
 +===== MPIRUN-1.2 (fixes) =====
 +
 +My experience ...
 +
 +<code>
 +
 +source is in /mnt/src/hmeij-tmp/foodir/src
 +
 +su delltest3
 +
 +export LD_LIBRARY_PATH="/home/delletst3/openmpi-1.2/lib:$LD_LIBRARY_PATH"
 +add this to ~/.bashrc
 +
 +cd /mnt/src/hmeij-tmp/foodir/src/openmpi-1.2
 +./configure --prefix /home/delltest3/openmpi-1.2 --disable-mpi-f90
 +make
 +make install
 +
 +cd ~\\
 +/home/delltest3/openmpi-1.2/bin/mpicc -o hello \ 
 +/mnt/src/hmeij-tmp/foodir/src/openmpi-1.2/examples/hello_c.c
 +
 +the machines file setup does not like 'nfs-2-1:8', 
 +so instead addd 8 lines for each node like this 'nfs-2-1'
 +
 +ldd hello
 +ldd openmpi-1.2/bin/mpirun
 +
 +test on a  single machine
 +/home/delltest3/openmpi-1.2/bin/mpirun -np 8 -machinefile \
 +/home/delltest3/machines  /home/delltest3/hello
 +
 +cd ~
 +(for some reason you need to do this for compilation to be successful)
 +ln -s /mnt/src/hmeij-tmp/foodir/src/hpl
 +cd hpl
 +cp ~/Make.Linux_PII_CBLAS .
 +make arch=Linux_PII_CBLAS  
 +cp bin/Linux_PII_CBLAS/xhpl ~
 +cp bin/Linux_PII_CBLAS/HPL.dat ~
 +
 +cd ~
 +/home/delltest3/openmpi-1.2/bin/mpirun -np 8 -machinefile \
 +/home/delltest3/machines /home/delltest3/xhpl > HPL.out
 +
 +</code>
 +\\
 +**[[cluster:23|Back]]**
cluster/26.txt ยท Last modified: 2007/04/19 15:28 (external edit)