Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

--- cluster:116 [2013/07/11 10:41]
hmeij [GPU-HPC]
+++ cluster:116 [2014/02/04 13:57] (current)
hmeij
@@ Line 1: / Line 1: @@
 \\
 **[[cluster:0|Back]]**
+Since deployment of sharptail the information below is out of date. /home is now the same across the entire HPCC and served out by sharptail.
+ --- //[[hmeij@wesleyan.edu|Meij, Henk]] 2014/02/04 13:56//
 ===== Sharptail Cluster =====
@@ Line 39: / Line 43: @@
 ==== /sanscratch ====
-Sharptail will provide the users (and scheduler) with another 5 TB scratch file system.  During this period it is only provided to the sharptail nodes (n33-n37). In the future it will provide this file system to all nodes except greentail nodes (n1-n32).
+Sharptail will provide the users (and scheduler) with another 5 TB scratch file system.  It is only provided to the sharptail nodes (n33-n37). All other nodes will have /sanscratch provided by greentail.  You can follow the progress of your by looking into /sanscratch/JOBPID directory on either greentail or sharptail.
   * Please offload as much IO from /home by staging your jobs in /sanscratch
@@ Line 78: / Line 82: @@
 In both cases you do not need to target any specific core, the operating system will handle that part of the scheduling.
+==== NOTE ====
+----
+Instructions below are obsolete, resources are now available via the scheduler.
+Please read [[cluster:119|Submitting GPU Jobs]]
+ --- //[[hmeij@wesleyan.edu|Meij, Henk]] 2013/08/21 10:46//
+----
 ==== CPU-HPC ====
-With hyperthreading on the 5 nodes, it provides for 160 cores.  We need to reserve 20 cores for the GPUs (one per GPU), and lets reserve another 20 cores for the OS (5 per node).  That still leaves 120 cores for regular jobs like you are used to on greentail.  These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb ''per core''.
+With hyperthreading on the 5 nodes, it provides for 160 cores.  We need to reserve 20 cores for the GPUs (one per GPU, 4 per node), and lets reserve another 20 cores for the OS (4 per node).  That still leaves 120 cores for regular jobs like you are used to on greentail.  These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb ''per core''.
 So since there is no scheduler, you need to setup your environment and execute your program.  Here is an example of a program that normally runs on the imw queue.  If your program involves MPI you need to be a bit up to speed on what the lava wrapper actually does for you.
@@ Line 159: / Line 176: @@
 Testing of GPUs at vendor sites may help get the idea of how to run GPU compiled code.
+LAMMPS and Amber were compiled against mvapich2. They should be run with "mpirun_rsh -ssh -hostfile /path/to/hostfile -np# other_program_options".
+[[cluster:109|Lammps GPU Testing]] ... may help shed some ideas
+Sharptail example.  The hostfile only has 1 line in it with one node name.  This allows LAMMPS to pick any idle GPU it finds, a potential clash problem.  The link above shows how to target GPUs by ID.
+<code>
+[hmeij@sharptail sharptail]$ cat hostfile
+n34
+[hmeij@sharptail sharptail]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile \
+-np 12 lmp_nVidia -sf gpu -c off -v g 2 -v x 32 -v y 32 -v z 64 -v t 100 <  \
+~/sharptailin.lj.gpu
+unloading gcc module
+LAMMPS (31 May 2013)
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+Created orthogonal box = (0 0 0) to (53.7471 53.7471 107.494)
+by 2 by 3 MPI processor grid
+Created 262144 atoms
+--------------------------------------------------------------------------
+- Using GPGPU acceleration for lj/cut:
+-  with 6 proc(s) per device.
+--------------------------------------------------------------------------
+GPU 0: Tesla K20m, 2496 cores, 4.3/4.7 GB, 0.71 GHZ (Mixed Precision)
+GPU 1: Tesla K20m, 2496 cores, 4.3/0.71 GHZ (Mixed Precision)
+--------------------------------------------------------------------------
+Initializing GPU and compiling on process 0...Done.
+Initializing GPUs 0-1 on core 0...Done.
+Initializing GPUs 0-1 on core 1...Done.
+Initializing GPUs 0-1 on core 2...Done.
+Initializing GPUs 0-1 on core 3...Done.
+Initializing GPUs 0-1 on core 4...Done.
+Initializing GPUs 0-1 on core 5...Done.
+Setting up run ...
+Memory usage per processor = 5.83686 Mbytes
+Step Temp E_pair E_mol TotEng Press
+         1.44   -6.7733676            0   -4.6133759   -5.0196742
+   0.75875604   -5.7604958            0    -4.622366   0.19306017
+Loop time of 0.431599 on 12 procs for 100 steps with 262144 atoms
+Pair  time (%) = 0.255762 (59.2592)
+Neigh time (%) = 4.80811e-06 (0.00111402)
+Comm  time (%) = 0.122923 (28.481)
+Outpt time (%) = 0.00109257 (0.253146)
+Other time (%) = 0.051816 (12.0056)
+Nlocal:    21845.3 ave 22013 max 21736 min
+Histogram: 2 3 3 0 0 0 0 2 1 1
+Nghost:    15524 ave 15734 max 15146 min
+Histogram: 2 2 0 0 0 0 0 0 3 5
+Neighs:    0 ave 0 max 0 min
+Histogram: 12 0 0 0 0 0 0 0 0 0
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 5
+Dangerous builds = 0
+---------------------------------------------------------------------
+      GPU Time Info (average):
+---------------------------------------------------------------------
+Neighbor (CPU):  0.0041 s.
+GPU Overhead:    0.0429 s.
+Average split:   1.0000.
+Threads / atom:  4.
+Max Mem / Proc:  31.11 MB.
+CPU Driver_Time: 0.0405 s.
+CPU Idle_Time:   0.2199 s.
+---------------------------------------------------------------------
+</code>
+[[cluster:111|Amber GPU Testing]] ... may help shed some ideas
+Note: ran out of time to get an example running but it should follow the LAMMPS approach of above pretty closely.  The binary is in /cm/share/apps/amber/amber12/bin/pmemd.cuda.MPI
+Here is quick Amber example
+<code>
+[hmeij@sharptail nucleosome]$ export AMBER_HOME=/cm/shared/apps/amber/amber12
+# find a GPU ID with gpu-info then expose that GPU to pmemd
+[hmeij@sharptail nucleosome]$ export CUDA_VISIBLE_DEVICES=1
+# you only need one cpu core
+[hmeij@sharptail nucleosome]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile -np 1 \
+/cm/shared/apps/amber/amber12/bin/pmemd.cuda.MPI -O -o mdout.1K10 -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd &
+</code>
-[[cluster:109|Lammps GPU Testing]]
-[[cluster:111|Amber GPU Testing]]
+NAMD was compiled with the built-in multi-node networking capabilities, including ibverbs support.
-Here is an example for NAMD:
+An example of running NAMD is below.
 <code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools