User Tools

Site Tools


cluster:109

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:109 [2013/02/02 15:06]
hmeij [Lammps GPU Testing (EC)]
cluster:109 [2013/10/16 15:13]
hmeij [Lammps GPU Testing (EC)]
Line 20: Line 20:
     CPU only 1 hour 5 mins     CPU only 1 hour 5 mins
     1 GPU 5 mins and 15 secs (a 18-19 times peed up)     1 GPU 5 mins and 15 secs (a 18-19 times peed up)
-    2 GPUs 2 mins (see below)+    2 GPUs 2 mins 
 +     
 +Above results seems overall a bit slower that at other vendor, but same pattern.
  
-Francis's Melt problem set, uses lj+Francis's Melt problem set
  
-^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^+^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^^^^^^
 |CPU only|  -np 1  |  -np 6 | -np 12  |  -np 24  |  -np 36  | |CPU only|  -np 1  |  -np 6 | -np 12  |  -np 24  |  -np 36  |
 |loop times|  329s  |  63s  |  39s  |    29s  |  45s  | |loop times|  329s  |  63s  |  39s  |    29s  |  45s  |
-|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  -np 1-4  |+|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4 |
 |loop times|  28s  |  16s |  11s  |  10s  |    | |loop times|  28s  |  16s |  11s  |  10s  |    |
-^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^+^3d Lennard-Jones melt: for 100,000 steps with 32,000 atoms^^^^^^ 
 +|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4)  | 
 +|loop times|  274s  |  162s |  120s  |  98s  |    |
  
 +  * Serial's time of 329s is reduced to 29s for MPI, an 11x speed up
 +  * GPU's serial time matches MPI -np 24 and can be further reduced to 10s, a 3x speed up
  
 +==== Redoing Above ====
  
 +**10/16/2013**
  
 +Redoing the melt problem now on our own K20 hardware I get the following (observing with gpu-info that utilization runs about 20-25% on the GPU allocated)
 +
 +Loop time of 345.936 on 1 procs for 100000 steps with 32000 atoms
 +
 +<code>
 +
 +#!/bin/bash                                                                                     
 +# submit via 'bsub < run.gpu'                                                                   
 +rm -f log.lammps melt.log                                                                       
 +#BSUB -e err                                                                                    
 +#BSUB -o out                                                                                    
 +#BSUB -q mwgpu                                                                                  
 +#BSUB -J test                                                                                   
 +
 +## leave sufficient time between job submissions (30-60 secs)
 +## the number of GPUs allocated matches -n value automatically
 +## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
 +## reserve 6144 MB (5 GB + 20%) memory per GPU
 +## run all processes (1<=n<=4)) on same node (hosts=1).
 +
 +#BSUB -n 1
 +#BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]"
 +
 +# from greentail we need to recreate module env
 +export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
 +/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
 +/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
 +/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
 +/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:\
 +/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:\
 +/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
 +export PATH=/share/apps/bin:$PATH
 +export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
 +/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
 +/cm/shared/apps/mvapich2/gcc/64/1.6/lib
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# LAMMPS
 +# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp)
 +export GPUIDX=1
 +# stage the data
 +cp ~/gpu_testing/fstarr/lj/ .
 +# feed the wrapper
 +lava.mvapich2.wrapper lmp_nVidia \
 +-c off -var GPUIDX $GPUIDX -in in.melt
 +# save results
 +cp log.lammps melt.log  ~/gpu_testing/fstarr/lj/
 +
 +
 +</code>
  
 ===== Lammps GPU Testing (MW) ===== ===== Lammps GPU Testing (MW) =====
cluster/109.txt · Last modified: 2013/10/16 15:13 by hmeij