Differences

This shows you the differences between two versions of the page.

--- cluster:182 [2019/08/12 14:41]
hmeij07 [Amber]
+++ cluster:182 [2019/08/12 16:45]
hmeij07 [Lammps]
@@ Line 44: / Line 44: @@
 ==== Amber ====
-The RTX compute node only had one GPU, the other nodes had 4 GPUs.
+The RTX compute node only had one GPU, the other nodes had 4 GPUs. In each run the mpi threads requested equaled the number of GPUs involved. Sample script bottom of page.
   * [DPFP] - Double Precision Forces, 64-bit Fixed point Accumulation.
@@ Line 51: / Line 51: @@
-^    ^    ^  P100[1]  ^  P100[4]  ^  RTX[1]  ^  T4[1]  ^  T4[4]  ^  Notes  ^
+^  ns/day  ^  P100[1]  ^  P100[4]  ^  RTX[1]  ^  T4[1]  ^  T4[4]  ^  Notes  ^
+|  DPFP  |  5.21|  18.35|  0.75|  0.35|  1.29|
+|  SXFP  |  11.82|  37.44|  17.05|  7.01|  18.91|
+|  SFFP  |  11.91|  40.98|  9.92|  4.35|  16.22|
+Like last testing outcome, in the SFFP precision mode it is best to run four individual jobs, one per GPU (mpi=1, gpu=1). Best performance is the P100 at 47.64 vs the RTX at 39.69 ns/day per node. The T4 runs about 1/3 as fast and really falters in DPFP precision mode. But in SXFP (experimental) precision mode the T4 makes up in performance.
+Can't complain about utilization rates.\\
+Amber mpi=4 gpu=4\\
+[heme@login1 amber16]$ ssh node7 ./gpu-info\\
+id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem\\
+, Tesla P100-PCIE-16GB, 79, 1052 MiB, 15228 MiB, 87 %, 1 %\\
+, Tesla P100-PCIE-16GB, 79, 1052 MiB, 15228 MiB, 95 %, 0 %\\
+, Tesla P100-PCIE-16GB, 79, 1052 MiB, 15228 MiB, 87 %, 0 %\\
+, Tesla P100-PCIE-16GB, 78, 1052 MiB, 15228 MiB, 94 %, 0 %\\
+==== Lammps ====
+Precision for GPU calculations
+  * DD -D_DOUBLE_DOUBLE  # Double precision for all calculations
+  * SD -D_SINGLE_DOUBLE  # Accumulation of forces, etc. in double
+  * SS -D_SINGLE_SINGLE  # Single precision for all calculations
+^  tau/day  ^  P100[1]  ^  P100[4]  ^  RTX[1]  ^  T4[1]  ^  T4[4]  ^  Notes  ^
+|  DD  |  856669.660|  ?|  600048.822|  518164.721|  1098621.095|  |
+|  SD  |  981897.313|  ?|  916225.855|  881247.547|  2294344.194|  |
+|  SS  |  1050796.986|  ?|  1035041.889|  1021477.986|  2541435.426|  |
+==== Scripts ====
+All 3 software applications were compiled within default environment and Cuda 10.1
+Currently Loaded Modules:\\
+) GCCcore/8.2.0     4) GCC/8.2.0-2.31.1   7) XZ/5.2.4           10) hwloc/1.11.11   13) FFTW/3.3.8\\
+) zlib/1.2.11       5) CUDA/10.1.105      8) libxml2/2.9.8      11) OpenMPI/3.1.3   14) ScaLAPACK/2.0.2-OpenBLAS-0.3.5\\
+) binutils/2.31.1   6) numactl/2.0.12     9) libpciaccess/0.14  12) OpenBLAS/0.3.5  15) fosscuda/2019a\\
+Follow\\
+https://dokuwiki.wesleyan.edu/doku.php?id=cluster:161\\
+  * Amber
+<code>
+#!/bin/bash
+#SBATCH --nodes=1
+#SBATCH --nodelist=node7
+#SBATCH --job-name="P100 dd"
+#SBATCH --ntasks-per-node=1
+#SBATCH --gres=gpu:1
+#SBATCH --exclusive
+# NSTEP = 40000
+rm -f restrt.1K10
+mpirun --oversubscribe -x LD_LIBRARY_PATH -np 1 \
+-H localhost \
+~/amber16/bin/pmemd.cuda_DPFP.MPI -O -o p100-dd-1-1 \
+-inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd
+</code>
+  * Lammps
+<code>
+#!/bin/bash
+#SBATCH --nodes=1
+#SBATCH --nodelist=node5
+#SBATCH --job-name="RTX dd"
+#SBATCH --gres=gpu:1
+#SBATCH --ntasks-per-node=1
+#SBATCH --exclusive
+# RTX
+mpirun --oversubscribe -x LD_LIBRARY_PATH -np 1 \
+-H localhost \
+~/lammps-5Jun19/lmp_mpi_double_double -suffix gpu -pk gpu 1 \
+-in in.colloid > rtx-1:1
+[heme@login1 lammps-5Jun19]$ squeue
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+    normal   RTX dd     heme  R       3:17      1 node5
+[heme@login1 lammps-5Jun19]$ ssh node5 ./gpu-info
+id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
+, Quadro RTX 6000, 50, 186 MiB, 24004 MiB, 51 %, 0 %
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools