Differences

This shows you the differences between two versions of the page.

--- cluster:107 [2012/12/17 18:44]
hmeij [ConfCall & Quote: HP]
+++ cluster:107 [2013/01/16 15:20]
hmeij [ConfCall & Quote: MW]
@@ Line 299: / Line 299: @@
   * buy a single rack and test locally, start small (will future racks be compatible?)
+==== Yale Qs ====
+Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
+  * What was the most important design element of the cluster?
+  * What factor(s) settled the CPU to GPU ratio?
+  * Was either, or neither, single or double precision peak performance more/less important?
+  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
+  * How did you reach out/educate users on the aspects of GPU computing?
+  * What was the impact on the users? (recoding, recompiling)
+  * Was the expected computational speed up realized?
+  * Was the PGI Accelerator compilers leveraged? If so what were the results?
+  * Do users compile with nvcc?
+  * Does the scheduler have a resource for idle GPUs so they can be reserved?
+  * How are the GPUs exposed/assigned to jobs the scheduler submits?
+  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
+  * Can parallel jobs access mutliple GPUs across nodes?
+  * Any experiences with pmemd.cuda.MPI (part of Amber)?
+  * What MPI flavor is used most in regards to GPU computing?
+  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
+Notes 04/01/2012 ConfCall
+  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
+  * Users did not share GPUs but could obtain more than one, always on same node
+  * Experimental setup with 36 gb/node, dual 8 core chips
+  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
+  * No raw code development
+  * Speed ups was hard to tell
+  * PGI Accelerator was used because it is needed with any Fortran code (Note!)
+  * Double precision was most important in scientific applications
+  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
+  * Book:  Programming Massively Parallel Processors, Second Edition:
+    * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
+    * Has examples of how to expose GPUs across nodes
 ==== ConfCall & Quote: AC ====
@@ Line 321: / Line 357: @@
 ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|
+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
@@ Line 448: / Line 484: @@
   * <del>First unit, single tray in chassis</del>
   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement
+  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
+    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
+    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
+  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option"
 ^  Topic^Description  ^
 |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node|
 |  Head Node|None|
-|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL270s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
+|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
 |  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
 |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)|
@@ Line 468: / Line 509: @@
-  * To compare with “benchmark option” price wise; ??% lower (25% less CPU cores)
+  * To compare with “benchmark option” price wise; 37% higher (25% less CPU cores)
   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)
-  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
+  * When quote is reduced to 1x s6500 chassis and 2x SL250s:
-    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
+    * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores)
-    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
+    * To compare with “benchmark option” performance; 25% lower (double precision peak)
-  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option"
   * HP on site install
@@ Line 567: / Line 607: @@
 |  | 5x upgrade to 64 GB per node|
-  * 2% more expansive than benchmark option, else identical
+  * At full load 5,900 Watts and 20,131 BTUs/hour
+  * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
     * But a new rack (advantageous for data center)
     * With lifetime technical support
     * solid state drives on compute nodes
+    * 12 TB local storage
-  * 36 port FDR switch overkill
+Then
-    * substitute with 12 port QDR switch
-    * and all servers to QDR
-  * Execute the Upgrade packages
+  * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
+    * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
+  * Expand memory footprint
+    * Go to 124 GB memory/noe to beef up the CPU HPC side of things
+    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
+  * Online testing available (K20, do this)
+    * then decide on PGI compiler at purchase time
+    * maybe all Lapack libraries too
+  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
+  * Leave the 6x2TB disk space (for backup)
+    * 2U, 8 drives up to 6x4=24 TB, possible?
+  * Add an entry level Infiniband/Lustre solution
+    * for parallel file locking
-  * Online testing available
+  * Spare parts
+    * 8 port switch, HCAs and cables, drives ...
+    * or get 5 years total warranty
+  * Testing notes
+    * Amber, LAMMPS, NAMD
+    * cuda v4&5
+    * install/config dirs
+    * use gnu ... with openmpi
+    * make deviceQuery
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools