cluster:109

Lammps GPU Testing (EC)
- Redoing Above
Lammps GPU Testing (MW)

Lammps GPU Testing (EC)

32 cores E2660
4 K20 GPU
workstation
MPICH2 flavor

Same tests (12 cpu cores) using lj/cut, eam, lj/expand, and morse: AU.reduced

  CPU only 6 mins 1 secs
  1 GPU 1 mins 1 secs (a 5-6 times speed up)
  2 GPUs 1 mins 0 secs (never saw 2nd GPU used, problem set too small?)

Same tests (12 cpu cores) using a restart file and using gayberne: GB

  CPU only 1 hour 5 mins
  1 GPU 5 mins and 15 secs (a 18-19 times peed up)
  2 GPUs 2 mins

Above results seems overall a bit slower that at other vendor, but same pattern.

Francis's Melt problem set

3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms
CPU only	-np 1	-np 6	-np 12	-np 24	-np 36
loop times	329s	63s	39s	29s	45s
GPU only	1xK20	2xK20	3xK20	4xK20	(-np 1-4)
loop times	28s	16s	11s	10s
3d Lennard-Jones melt: for 100,000 steps with 32,000 atoms
GPU only	1xK20	2xK20	3xK20	4xK20	(-np 1-4)
loop times	274s	162s	120s	98s

Serial's time of 329s is reduced to 29s for MPI, an 11x speed up
GPU's serial time matches MPI -np 24 and can be further reduced to 10s, a 3x speed up

Redoing Above

10/16/2013

Redoing the melt problem now on our own K20 hardware I get the following (observing with gpu-info that utilization runs about 20-25% on the GPU allocated)

Loop time of 345.936 on 1 procs for 100000 steps with 32000 atoms

#!/bin/bash                                                                                     
# submit via 'bsub < run.gpu'                                                                   
rm -f log.lammps melt.log                                                                       
#BSUB -e err                                                                                    
#BSUB -o out                                                                                    
#BSUB -q mwgpu                                                                                  
#BSUB -J test                                                                                   

## leave sufficient time between job submissions (30-60 secs)
## the number of GPUs allocated matches -n value automatically
## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
## reserve 6144 MB (5 GB + 20%) memory per GPU
## run all processes (1<=n<=4)) on same node (hosts=1).

#BSUB -n 1
#BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]"

# from greentail we need to recreate module env
export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:\
/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:\
/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:\
/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
export PATH=/share/apps/bin:$PATH
export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:\
/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
/cm/shared/apps/mvapich2/gcc/64/1.6/lib

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
cd $MYSANSCRATCH

# LAMMPS
# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp)
export GPUIDX=1
# stage the data
cp ~/gpu_testing/fstarr/lj/*  .
# feed the wrapper
lava.mvapich2.wrapper lmp_nVidia \
-c off -var GPUIDX $GPUIDX -in in.melt
# save results
cp log.lammps melt.log  ~/gpu_testing/fstarr/lj/

Lammps GPU Testing (MW)

Vendor: “There are currently two systems available, each with two 8-core Xeon E5-2670 processors, 32GB memory, 120GB SSD and two Tesla K20 GPUs. The hostnames are master and node2. You will see that a GPU-accelerated version of LAMMPS with MPI support is installed in /usr/local/LAMMPS.”

Actually, turns out there are 32 cores on node so I suspect four CPUs.

First, we expose the GPUs to Lammps (so running with a value of -1 ignores the GPUs) in our input file.

# Enable GPU's if variable is set.
if "(${GPUIDX} >= 0)" then &
        "suffix gpu" &
        "newton off" &
        "package gpu force 0 ${GPUIDX} 1.0"

Then we invoke the Lammps executable with MPI.

NODES=1      # number of nodes [=>1]
GPUIDX=0     # GPU indices range from [0,1], this is the upper bound.
             # set GPUIDX=0 for 1 GPU/node or GPUIDX=1 for 2 GPU/node
CORES=12     # Cores per node. (i.e. 2 CPUs with 6 cores ea =12 cores per node)

which mpirun

echo "*** GPU run with one MPI process per core ***"
date
mpirun -np $((NODES*CORES)) -bycore ./lmp_ex1 -c off -var GPUIDX $GPUIDX \
       -in film.inp -l film_1_gpu_1_node.log
date

Some tests using lj/cut, eam, lj/expand, and morse:

CPU only 4 mins 30 secs
1 GPU 0 mins 47 secs (a 5-6 times speed up)
2 GPUs 0 mins 46 secs (never saw 2nd GPU used, problem set too small?)

Some tests using a restart file and using gayberne,

CPU only 1 hour 5 mins
1 GPU 3 mins and 33 secs (a 18-19 times peed up)
2 GPUs 2 mins (see below)

node2$ gpu-info
====================================================
Device  Model           Temperature     Utilization
====================================================
0       Tesla K20m      36 C            96 %
1       Tesla K20m      34 C            92 %
====================================================

Back

Table of Contents

Lammps GPU Testing (EC)

Redoing Above

Lammps GPU Testing (MW)