Differences

This shows you the differences between two versions of the page.

--- cluster:109 [2013/01/17 19:26] – [Laamps GPU Testing at MW] hmeij
+++ cluster:109 [2013/10/16 19:13] (current) – [Lammps GPU Testing (EC)] hmeij
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
-===== Laamps GPU Testing at MW =====
+===== Lammps GPU Testing (EC) =====
+  * 32 cores E2660
+  * 4 K20 GPU
+  * workstation
+  * MPICH2 flavor
+Same tests (12 cpu cores) using lj/cut, eam, lj/expand, and morse: **AU.reduced**
+    CPU only 6 mins 1 secs
+GPU 1 mins 1 secs (a 5-6 times speed up)
+GPUs 1 mins 0 secs (never saw 2nd GPU used, problem set too small?)
+Same tests (12 cpu cores) using a restart file and using gayberne: **GB**
+    CPU only 1 hour 5 mins
+GPU 5 mins and 15 secs (a 18-19 times peed up)
+GPUs 2 mins
+Above results seems overall a bit slower that at other vendor, but same pattern.
+Francis's Melt problem set
+^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^^^^^^
+|CPU only|  -np 1  |  -np 6 | -np 12  |  -np 24  |  -np 36  |
+|loop times|  329s  |  63s  |  39s  |    29s  |  45s  |
+|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4)  |
+|loop times|  28s  |  16s |  11s  |  10s  |    |
+^3d Lennard-Jones melt: for 100,000 steps with 32,000 atoms^^^^^^
+|GPU only|  1xK20  |  2xK20 |  3xK20  |  4xK20  |  (-np 1-4)  |
+|loop times|  274s  |  162s |  120s  |  98s  |    |
+  * Serial's time of 329s is reduced to 29s for MPI, an 11x speed up
+  * GPU's serial time matches MPI -np 24 and can be further reduced to 10s, a 3x speed up
+==== Redoing Above ====
+**10/16/2013**
+Redoing the melt problem now on our own K20 hardware I get the following (observing with gpu-info that utilization runs about 20-25% on the GPU allocated)
+Loop time of 345.936 on 1 procs for 100000 steps with 32000 atoms
+<code>
+#!/bin/bash
+# submit via 'bsub < run.gpu'
+rm -f log.lammps melt.log
+#BSUB -e err
+#BSUB -o out
+#BSUB -q mwgpu
+#BSUB -J test
+## leave sufficient time between job submissions (30-60 secs)
+## the number of GPUs allocated matches -n value automatically
+## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
+## reserve 6144 MB (5 GB + 20%) memory per GPU
+## run all processes (1<=n<=4)) on same node (hosts=1).
+#BSUB -n 1
+#BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]"
+# from greentail we need to recreate module env
+export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
+/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
+/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
+/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
+/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:\
+/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:\
+/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:\
+/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
+export PATH=/share/apps/bin:$PATH
+export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
+/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
+/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
+/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
+/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:\
+/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
+/cm/shared/apps/mvapich2/gcc/64/1.6/lib
+# unique job scratch dirs
+MYSANSCRATCH=/sanscratch/$LSB_JOBID
+MYLOCALSCRATCH=/localscratch/$LSB_JOBID
+export MYSANSCRATCH MYLOCALSCRATCH
+cd $MYSANSCRATCH
+# LAMMPS
+# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp)
+export GPUIDX=1
+# stage the data
+cp ~/gpu_testing/fstarr/lj/*  .
+# feed the wrapper
+lava.mvapich2.wrapper lmp_nVidia \
+-c off -var GPUIDX $GPUIDX -in in.melt
+# save results
+cp log.lammps melt.log  ~/gpu_testing/fstarr/lj/
+</code>
+===== Lammps GPU Testing (MW) =====
 Vendor: "There are currently two systems available, each with two 8-core Xeon E5-2670 processors, 32GB memory, 120GB SSD and two Tesla K20 GPUs. The hostnames are master and node2.