User Tools

Site Tools


cluster:109

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:109 [2013/01/29 13:27]
hmeij
cluster:109 [2013/10/16 15:13] (current)
hmeij [Lammps GPU Testing (EC)]
Line 4: Line 4:
 ===== Lammps GPU Testing (EC) ===== ===== Lammps GPU Testing (EC) =====
  
 +  * 32 cores E2660
 +  * 4 K20 GPU
 +  * workstation
 +  * MPICH2 flavor
 +
 +
 +Same tests (12 cpu cores) using lj/cut, eam, lj/expand, and morse: **AU.reduced**
 +
 +    CPU only 6 mins 1 secs
 +    1 GPU 1 mins 1 secs (a 5-6 times speed up)
 +    2 GPUs 1 mins 0 secs (never saw 2nd GPU used, problem set too small?)
 +
 +Same tests (12 cpu cores) using a restart file and using gayberne: **GB**
 +
 +    CPU only 1 hour 5 mins
 +    1 GPU 5 mins and 15 secs (a 18-19 times peed up)
 +    2 GPUs 2 mins
 +    
 +Above results seems overall a bit slower that at other vendor, but same pattern.
 +
 +Francis's Melt problem set
 +
 +^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^^^^^^
 +|CPU only|  -np 1  |  -np 6 | -np 12  |  -np 24  |  -np 36  |
 +|loop times|  329s  |  63s  |  39s  |    29s  |  45s  |
 +|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4)  |
 +|loop times|  28s  |  16s |  11s  |  10s  |    |
 +^3d Lennard-Jones melt: for 100,000 steps with 32,000 atoms^^^^^^
 +|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4)  |
 +|loop times|  274s  |  162s |  120s  |  98s  |    |
 +
 +  * Serial's time of 329s is reduced to 29s for MPI, an 11x speed up
 +  * GPU's serial time matches MPI -np 24 and can be further reduced to 10s, a 3x speed up
 +
 +==== Redoing Above ====
 +
 +**10/16/2013**
 +
 +Redoing the melt problem now on our own K20 hardware I get the following (observing with gpu-info that utilization runs about 20-25% on the GPU allocated)
 +
 +Loop time of 345.936 on 1 procs for 100000 steps with 32000 atoms
 +
 +<code>
 +
 +#!/bin/bash                                                                                     
 +# submit via 'bsub < run.gpu'                                                                   
 +rm -f log.lammps melt.log                                                                       
 +#BSUB -e err                                                                                    
 +#BSUB -o out                                                                                    
 +#BSUB -q mwgpu                                                                                  
 +#BSUB -J test                                                                                   
 +
 +## leave sufficient time between job submissions (30-60 secs)
 +## the number of GPUs allocated matches -n value automatically
 +## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
 +## reserve 6144 MB (5 GB + 20%) memory per GPU
 +## run all processes (1<=n<=4)) on same node (hosts=1).
 +
 +#BSUB -n 1
 +#BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]"
 +
 +# from greentail we need to recreate module env
 +export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
 +/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
 +/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
 +/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
 +/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:\
 +/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:\
 +/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
 +export PATH=/share/apps/bin:$PATH
 +export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
 +/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:\
 +/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
 +/cm/shared/apps/mvapich2/gcc/64/1.6/lib
 +
 +# unique job scratch dirs
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# LAMMPS
 +# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp)
 +export GPUIDX=1
 +# stage the data
 +cp ~/gpu_testing/fstarr/lj/ .
 +# feed the wrapper
 +lava.mvapich2.wrapper lmp_nVidia \
 +-c off -var GPUIDX $GPUIDX -in in.melt
 +# save results
 +cp log.lammps melt.log  ~/gpu_testing/fstarr/lj/
 +
 +
 +</code>
  
 ===== Lammps GPU Testing (MW) ===== ===== Lammps GPU Testing (MW) =====
cluster/109.1359484025.txt.gz ยท Last modified: 2013/01/29 13:27 by hmeij