Differences

This shows you the differences between two versions of the page.

--- cluster:107 [2012/12/19 14:54]
hmeij [ConfCall & Quote: MW]
+++ cluster:107 [2013/01/16 15:20]
hmeij [ConfCall & Quote: MW]
@@ Line 299: / Line 299: @@
   * buy a single rack and test locally, start small (will future racks be compatible?)
+==== Yale Qs ====
+Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
+  * What was the most important design element of the cluster?
+  * What factor(s) settled the CPU to GPU ratio?
+  * Was either, or neither, single or double precision peak performance more/less important?
+  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
+  * How did you reach out/educate users on the aspects of GPU computing?
+  * What was the impact on the users? (recoding, recompiling)
+  * Was the expected computational speed up realized?
+  * Was the PGI Accelerator compilers leveraged? If so what were the results?
+  * Do users compile with nvcc?
+  * Does the scheduler have a resource for idle GPUs so they can be reserved?
+  * How are the GPUs exposed/assigned to jobs the scheduler submits?
+  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
+  * Can parallel jobs access mutliple GPUs across nodes?
+  * Any experiences with pmemd.cuda.MPI (part of Amber)?
+  * What MPI flavor is used most in regards to GPU computing?
+  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
+Notes 04/01/2012 ConfCall
+  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
+  * Users did not share GPUs but could obtain more than one, always on same node
+  * Experimental setup with 36 gb/node, dual 8 core chips
+  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
+  * No raw code development
+  * Speed ups was hard to tell
+  * PGI Accelerator was used because it is needed with any Fortran code (Note!)
+  * Double precision was most important in scientific applications
+  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
+  * Book:  Programming Massively Parallel Processors, Second Edition:
+    * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
+    * Has examples of how to expose GPUs across nodes
 ==== ConfCall & Quote: AC ====
@@ Line 321: / Line 357: @@
 ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|
+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
@@ Line 571: / Line 607: @@
 |  | 5x upgrade to 64 GB per node|
-  * At full load 5,900 Watts, 20,131 BTUs/hour
+  * At full load 5,900 Watts and 20,131 BTUs/hour
   * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
@@ Line 577: / Line 613: @@
     * With lifetime technical support
     * solid state drives on compute nodes
+    * 12 TB local storage
 Then
@@ Line 582: / Line 619: @@
   * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
     * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
-  * Execute the Upgrade packages
+  * Expand memory footprint
-    * We could go to 124 GB memory and beef up the CPU HPC side of things
+    * Go to 124 GB memory/noe to beef up the CPU HPC side of things
-    * 16 cpu cores/nodes-4 cpu/gpu cores/node=12 cpu cores using 104gb which is about 8 GB/cpu core
+    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
   * Online testing available (K20, do this)
     * then decide on PGI compiler at purchase time
     * maybe all Lapack libraries too
-  * Make the head node a compute node (in/for the future and beef it up too)
+  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
-  * Remove the 6x2TB disk space and add an entry level Infiniband/Lustre solution
+  * Leave the 6x2TB disk space (for backup)
+    * 2U, 8 drives up to 6x4=24 TB, possible?
+  * Add an entry level Infiniband/Lustre solution
+    * for parallel file locking
+  * Spare parts
+    * 8 port switch, HCAs and cables, drives ...
+    * or get 5 years total warranty
+  * Testing notes
+    * Amber, LAMMPS, NAMD
+    * cuda v4&5
+    * install/config dirs
+    * use gnu ... with openmpi
+    * make deviceQuery
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools