User Tools

Site Tools


cluster:26


Back

The production copy of OpenMPI is in /share/apps/openmpi-1.2.
Henk Meij 2007/04/19 15:27

HPLinpack Runs

The purpose here is to rerun the HPLinpack benchmarks Amol ran while configuring the cluster.

Before
Idle!
During
Heat!
Ooops
Burn!

FAQ External Link

Problem Sizes

N calculation, for example:
4 nodes, 4 gb each is 16 gb total which yields 2 gb double precision (8 byte) elements … 2gb is 2*1024*1024*1024 = 2,147,395,600 … take the square root of that and round 46,340 … 80% of that is 37072

N calculation 16 nodes (infiniband or ethernet):
16 nodes, 4 gb each is 64 gb total which yields 8 gb double precision (8 byte) elements … 8gb is 8*1024*1024*1024 = 8,589,934,592 … take the square root of that and round 92,681 … 80% of that is 74145

N calculation 4 heavy weight nodes:
4 nodes, 16 gb each is 64 gb total which yields 8 gb double precision (8 byte) elements … 8gb is 8*1024*1024*1024 = 8,589,934,592 … take the square root of that and round 92,681 … 80% of that is 74145

NB calculations:
range of 32…256
ood starting values are 88 132

PxQ Grid:
max value PxQ should equal nr of cores
P<Q … close for infiniband but P much smaller than Q for ethernet

LWNi (np=128): P=8, Q=16
LWNe (np=128): P=4, Q=32 or P=2, Q=64
HWN (np=32): P=4, Q=8 or P=2, Q=16

Infiniband (16 nodes)

  • nodes: compute-1-1 thru compute-1-16
  • each dual quad 2.6 ghz PE1950 (2x4x16 totals 128 cores)
  • each with 4 gb ram (4×16=64 gb total memory)

HPL.dat

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
7            device out (6=stdout,7=stderr,file)
8            # of problems sizes (N)
1000 5000 10000 15000 20000 25000 30000 35000 Ns
6            # of NBs
200 300 400 500 600 700     NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
8            Ps
16           Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

run script

⇒ su delltest

#!/bin/bash

echo "setting P4_GLOBMEMSIZE=10000000"

P4_GLOBMEMSIZE=10000000
export P4_GLOBMEMSIZE

echo "invoking..."
echo "/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 128 -hostfile ./machines ./xhpl"

date > HPL.start
(/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 128 -hostfile ./machines ./xhpl > HPL.out 2>&1) 

⇒ with above HPL.dat file, this configuration runs for 8 hours …

Ethernet (16 nodes)

  • nodes: compute-1-17 thru compute-2-32
  • each dual quad 2.6 ghz PE1950 (2x4x16 totals 128 cores)
  • each with 4 gb ram (4×16=64 gb total memory)

HPL.dat

Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
7            device out (6=stdout,7=stderr,file)
1             # of problems sizes (N)
74145 Ns
1             # of NBs
88 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
32           Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

run script

⇒ su delltest2

#!/bin/bash

echo "setting P4_GLOBMEMSIZE=10000000"

P4_GLOBMEMSIZE=10000000
export P4_GLOBMEMSIZE

date > HPL.start


echo "invoking..."
echo "/home/delltest2/openmpi-1.2/bin/mpirun -np 128 -machinefile 
/home/delltest2/machines /home/delltest2/xhpl"

(/home/delltest2/openmpi-1.2/bin/mpirun -np 128 -machinefile 
/home/delltest2/machines home/delltest2/xhpl > /home/delltest2/HPL.out 2>&1)&

⇒ runs for 4 hours … change these lines below and it'll run for 14 hours

2             # of problems sizes (N)
74145 74145 Ns
2             # of NBs
88 132 NBs

Ethernet (4 nodes)

  • nodes: nfs-2-1 thru nfs-2-4
  • each dual quad 2.6 ghz PE1950 (2x4x4 totals 32 cores)
  • each with 16 gb ram (16×4=64 gb total memory)

HPL.dat

Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
7            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
74145 Ns
1            # of NBs
88 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
8            Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

run script

⇒ su delltest3

#!/bin/bash

echo "setting P4_GLOBMEMSIZE=10000000"

P4_GLOBMEMSIZE=10000000
export P4_GLOBMEMSIZE

date > HPL.start


echo "invoking..."
echo "/home/delltest3/openmpi-1.2/bin/mpirun -np 32 -machinefile 
 /home/delltest3/machines /home/delltest3/xhpl"

(/home/delltest3/openmpi-1.2/bin/mpirun -np 32 -machinefile /home/delltest3/machines
 /home/delltest3/xhpl > /home/delltest3/HPL.out 2>&1)&

⇒ runs for 14 1/2 hours … change these lines and it'll run for 2 days

2             # of NBs
88 132 NBs

MPIRUN-1.2

From Sili at Platform …

Actrually, MPICH 1 always has problems in the shared memory control. It really takes time to debug on the buggy shared memory stuff. I would rather suggest using openmpi instead of MPICH 1 to launch Ethernet linpack testings as openmpi is a newer and better MPI implementation than MPICH 1 and it is MPI-2 compatible plus it supports for both ethernet and infiniband devices.
 
The precedure I just tested is as follows

1. Compile Openmpi
Here is the procedure I used to recompile openmpi :
# ./configure --prefix=/home/shuang/openmpi-1.2 --disable-mpi-f90
# make
# make install
 
To test the installation, create a host file. I generated a hostfile :
# cat /etc/hosts | grep compute | awk '{print $3} ' > machines
 
Then I recompiled the hello example (the hello_c.c file can be found at the examples directory on the untar'd source directory):
# /home/shuang/openmpi-1.2/bin/mpicc -o hello ./hello_c.c
 
And tested it :
# /home/shuang/openmpi-1.2/bin/mpirun -np 4 -machinefile machines --prefix /home/shuang/openmpi-1.2 ./hello
 
Please note that I used the complete path to the executables because by default, lam will be picked up. This is also why I used the --prefix option. You may want to use modules to load / unload these environment settings. Please let me know if you would like to have more information about this (open-source) software.
 
2. Compile Linpack with Openmpi
 
# wget http://www.netlib.org/benchmark/hpl/hpl.tgz
# tar zxf hpl.tgz
# cd hpl
# cp setup/Make.Linux_PII_CBLAS .
edit Make.Linux_PII_CBLAS, change "MPdir" to "/home/shuang/openmpi-1.2", change "MPlib" to "$(MPdir)/lib/libmpi.so", change "LAdir" to "/usr/lib64", change "CC" to "/home/shuang/openmpi-1.2/bin/mpicc", and change "LINKER" to "/home/shuang/openmpi-1.2/bin/mpicc".
 
Then you can make linpack by
# make arch=Linux_PII_CBLAS
 
To test it, edit the HPL.dat accordingly and run by:
# /home/shuang/openmpi-1.2/bin/mpirun -np 8 -machinefile machines --prefix /home/shuang/openmpi-1.2 ./xhpl

MPIRUN-1.2 (fixes)

My experience …

source is in /mnt/src/hmeij-tmp/foodir/src

su delltest3

export LD_LIBRARY_PATH="/home/delletst3/openmpi-1.2/lib:$LD_LIBRARY_PATH"
add this to ~/.bashrc

cd /mnt/src/hmeij-tmp/foodir/src/openmpi-1.2
./configure --prefix /home/delltest3/openmpi-1.2 --disable-mpi-f90
make
make install

cd ~\\
/home/delltest3/openmpi-1.2/bin/mpicc -o hello \ 
/mnt/src/hmeij-tmp/foodir/src/openmpi-1.2/examples/hello_c.c

the machines file setup does not like 'nfs-2-1:8', 
so instead addd 8 lines for each node like this 'nfs-2-1'

ldd hello
ldd openmpi-1.2/bin/mpirun

test on a  single machine
/home/delltest3/openmpi-1.2/bin/mpirun -np 8 -machinefile \
/home/delltest3/machines  /home/delltest3/hello

cd ~
(for some reason you need to do this for compilation to be successful)
ln -s /mnt/src/hmeij-tmp/foodir/src/hpl
cd hpl
cp ~/Make.Linux_PII_CBLAS .
make arch=Linux_PII_CBLAS  
cp bin/Linux_PII_CBLAS/xhpl ~
cp bin/Linux_PII_CBLAS/HPL.dat ~

cd ~
/home/delltest3/openmpi-1.2/bin/mpirun -np 8 -machinefile \
/home/delltest3/machines /home/delltest3/xhpl > HPL.out


Back

cluster/26.txt · Last modified: 2007/04/19 15:28 (external edit)