Differences

This shows you the differences between two versions of the page.

--- cluster:116 [2013/07/11 15:22]
hmeij [GPU-HPC]
+++ cluster:116 [2013/08/12 17:45]
hmeij
@@ Line 82: / Line 82: @@
 ==== CPU-HPC ====
-With hyperthreading on the 5 nodes, it provides for 160 cores.  We need to reserve 20 cores for the GPUs (one per GPU), and lets reserve another 20 cores for the OS (5 per node).  That still leaves 120 cores for regular jobs like you are used to on greentail.  These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb ''per core''.
+With hyperthreading on the 5 nodes, it provides for 160 cores.  We need to reserve 20 cores for the GPUs (one per GPU, 4 per node), and lets reserve another 20 cores for the OS (4 per node).  That still leaves 120 cores for regular jobs like you are used to on greentail.  These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb ''per core''.
 So since there is no scheduler, you need to setup your environment and execute your program.  Here is an example of a program that normally runs on the imw queue.  If your program involves MPI you need to be a bit up to speed on what the lava wrapper actually does for you.
@@ Line 161: / Line 161: @@
 LAMMPS and Amber were compiled against mvapich2. They should be run with "mpirun_rsh -ssh -hostfile /path/to/hostfile -np# other_program_options".
-[[cluster:109|Lammps GPU Testing]]
+[[cluster:109|Lammps GPU Testing]] ... may help shed some ideas
+Sharptail example.  The hostfile only has 1 line in it with one node name.  This allows LAMMPS to pick any idle GPU it finds, a potential clash problem.  The link above shows how to target GPUs by ID.
+<code>
+[hmeij@sharptail sharptail]$ cat hostfile
+n34
+[hmeij@sharptail sharptail]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile \
+-np 12 lmp_nVidia -sf gpu -c off -v g 2 -v x 32 -v y 32 -v z 64 -v t 100 <  \
+~/sharptailin.lj.gpu
+unloading gcc module
+LAMMPS (31 May 2013)
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+Created orthogonal box = (0 0 0) to (53.7471 53.7471 107.494)
+by 2 by 3 MPI processor grid
+Created 262144 atoms
+--------------------------------------------------------------------------
+- Using GPGPU acceleration for lj/cut:
+-  with 6 proc(s) per device.
+--------------------------------------------------------------------------
+GPU 0: Tesla K20m, 2496 cores, 4.3/4.7 GB, 0.71 GHZ (Mixed Precision)
+GPU 1: Tesla K20m, 2496 cores, 4.3/0.71 GHZ (Mixed Precision)
+--------------------------------------------------------------------------
+Initializing GPU and compiling on process 0...Done.
+Initializing GPUs 0-1 on core 0...Done.
+Initializing GPUs 0-1 on core 1...Done.
+Initializing GPUs 0-1 on core 2...Done.
+Initializing GPUs 0-1 on core 3...Done.
+Initializing GPUs 0-1 on core 4...Done.
+Initializing GPUs 0-1 on core 5...Done.
+Setting up run ...
+Memory usage per processor = 5.83686 Mbytes
+Step Temp E_pair E_mol TotEng Press
+         1.44   -6.7733676            0   -4.6133759   -5.0196742
+   0.75875604   -5.7604958            0    -4.622366   0.19306017
+Loop time of 0.431599 on 12 procs for 100 steps with 262144 atoms
+Pair  time (%) = 0.255762 (59.2592)
+Neigh time (%) = 4.80811e-06 (0.00111402)
+Comm  time (%) = 0.122923 (28.481)
+Outpt time (%) = 0.00109257 (0.253146)
+Other time (%) = 0.051816 (12.0056)
+Nlocal:    21845.3 ave 22013 max 21736 min
+Histogram: 2 3 3 0 0 0 0 2 1 1
+Nghost:    15524 ave 15734 max 15146 min
+Histogram: 2 2 0 0 0 0 0 0 3 5
+Neighs:    0 ave 0 max 0 min
+Histogram: 12 0 0 0 0 0 0 0 0 0
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 5
+Dangerous builds = 0
+---------------------------------------------------------------------
+      GPU Time Info (average):
+---------------------------------------------------------------------
+Neighbor (CPU):  0.0041 s.
+GPU Overhead:    0.0429 s.
+Average split:   1.0000.
+Threads / atom:  4.
+Max Mem / Proc:  31.11 MB.
+CPU Driver_Time: 0.0405 s.
+CPU Idle_Time:   0.2199 s.
+---------------------------------------------------------------------
+</code>
+[[cluster:111|Amber GPU Testing]] ... may help shed some ideas
+Note: ran out of time to get an example running but it should follow the LAMMPS approach of above pretty closely.  The binary is in /cm/share/apps/amber/amber12/bin/pmemd.cuda.MPI
+Here is quick Amber example
+<code>
+[hmeij@sharptail nucleosome]$ export AMBER_HOME=/cm/shared/apps/amber/amber12
+# find a GPU ID with gpu-info then expose that GPU to pmemd
+[hmeij@sharptail nucleosome]$ export CUDA_VISIBLE_DEVICES=1
+# you only need one cpu core
+[hmeij@sharptail nucleosome]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile -np 1 \
+/cm/shared/apps/amber/amber12/bin/pmemd.cuda.MPI -O -o mdout.1K10 -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd &
+</code>
-[[cluster:111|Amber GPU Testing]]
 NAMD was compiled with the built-in multi-node networking capabilities, including ibverbs support.

DokuWiki

User Tools

Site Tools

Differences

Page Tools