Differences

This shows you the differences between two versions of the page.

--- cluster:107 [2012/12/12 21:00]
hmeij [ConfCall & Quote: EC]
+++ cluster:107 [2013/01/04 15:40]
hmeij [Yale Qs]
@@ Line 299: / Line 299: @@
   * buy a single rack and test locally, start small (will future racks be compatible?)
+==== Yale Qs ====
+Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
+  * What was the most important design element of the cluster?
+  * What factor(s) settled the CPU to GPU ratio?
+  * Was either, or neither, single or double precision peak performance more/less important?
+  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
+  * How did you reach out/educate users on the aspects of GPU computing?
+  * What was the impact on the users? (recoding, recompiling)
+  * Was the expected computational speed up realized?
+  * Was the PGI Accelerator compilers leveraged? If so what were the results?
+  * Do users compile with nvcc?
+  * Does the scheduler have a resource for idle GPUs so they can be reserved?
+  * How are the GPUs exposed/assigned to jobs the scheduler submits?
+  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
+  * Can parallel jobs access mutliple GPUs across nodes?
+  * Any experiences with pmemd.cuda.MPI (part of Amber)?
+  * What MPI flavor is used most in regards to GPU computing?
+  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
+Notes 04/01/2012 ConfCall
+  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
+  * Users did not share GPUs but could obtain more than one, always on same node
+  * Experimental setup with 36 gb/node, dual 8 core chips
+  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
+  * No raw code develoment
+  * Speed ups was hard to tell
+  * PGI Accelerator was used becuase it is needed with any Fortran code (Note!)
+  * Double precision was most important in scientific applications
+  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
+  * Book:  Programming Massively Parallel Processors, Second Edition: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
+    * Has examples of how to expose GPUs across nodes
 ==== ConfCall & Quote: AC ====
@@ Line 321: / Line 356: @@
 ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|
+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
@@ Line 367: / Line 402: @@
 **EC Quote**
-U rack on wheels\\
-[[http://exxactcorp.com/index.php/solution/solu_detail/119]] Fremont, CA
+  * [[http://exxactcorp.com/index.php/solution/solu_detail/119]] Fremont, CA
 ^  Topic^Description  ^
@@ Line 395: / Line 430: @@
   * This is the (newest) simcluster design (that can be tested starting Jan 2013)
+    * 24U cabinet
   * We could deprecate 50% of bss24 queue freeing two L6-30 connectors
   * Spare parts:
@@ Line 447: / Line 483: @@
   * <del>First unit, single tray in chassis</del>
   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement
+  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
+    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
+    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
+  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option"
 ^  Topic^Description  ^
 |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node|
 |  Head Node|None|
-|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL270s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
+|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
-|  Nodes| 3xSL270s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
+|  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
 |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)|
 |  |3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio|
@@ Line 464: / Line 505: @@
 |  Warranty|3 Year Parts and Labor (HP technical support?)|
 |  GPU Teraflops|21.06 double, 63.36 single|
-|  Quote|<html><!-- $00,000 + S&H --></html>Coming|
+|  Quote|<html><!-- $128,370, for a 1x6500+2xSl250 setup estimate is $95,170 --></html>Arrived (S&H and insurance?)|
-  * To compare with “benchmark option” price wise; ??% lower (25% less CPU cores)
+  * To compare with “benchmark option” price wise; 37% higher (25% less CPU cores)
   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)
-  * 2 chassis in 8U + 4 SL270s + each with 8 GPUs would be a massive GPU cruncher
+  * When quote is reduced to 1x s6500 chassis and 2x SL250s:
-    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
+    * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores)
-    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
+    * To compare with “benchmark option” performance; 25% lower (double precision peak)
-  * 1 chassis in 4U + 2 Sl270s + each with * GPUs would the "benchmark option"
   * HP on site install
@@ Line 538: / Line 578: @@
 [[http://www.microway.com/tesla/clusters.html]] Plymouth, MA :!:
-Table here ...
-  * Technical support for the life of the hardware
+^  Topic^Description  ^
-  * Online testing available
+|  General| 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 32 gb ram/node, plus head node|
-  *
+|  Head Node|1x2U Rackmount System, 2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores|
+|  |8x4GB 240-Pin DDR3 1600 MHz ECC (max 512gb), 2x10/100/1000 NIC, 3x PCIe x16 Full, 3x PCIe x8|
+|  |2x1TB 7200RPM (Raid 1) + 6x2TB (Raid 6), Areca Raid Controller|
+|  |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s|
+|  |740w Power Supply 1+1 redundant|
+|  Nodes|4x1U Rackmountable Chassis, 4x2 Xeon E5-2650 2.0 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series|
+|  |4x8x4GB 240-Pin DDR3 1600 MHz (32gb/node memory, 8gb/gpu, max 256gb)|
+|  |4x1x120GB SSD 7200RPM, 4x4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio|
+|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots|
+|  |4xConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s|
+|  |4x1800W (non) Redundant Power Supplies|
+|  Network|1x Mellanox InfiniBand FDR Switch (36 ports)& HCAs (single port) + 3m cable FDR to existing Voltaire switch|
+|  |1x 1U 48 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)|
+|Rack  |1x42U rack with power distribution
+|  Power|2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)|
+|  Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA 5|
+|  | scheduler and gnu compilers installed and configured|
+|  | Amber12, Lammps, Barracuda (for weirlab?), and others if desired ...bought through MW|
+|  Warranty|3 Year Parts and Labor (lifetime technical support)|
+|  GPU Teraflops|18.72 double, 56.32 single|
+|  Quote|<html><!-- estimated at $95,800 --></html>Arrived, includes S&H and Insurance|
+|Upgrades  |Cluster pre-installation service  |
+|  | 5x2 E5-2660 2.20 Ghz 8 core CPUs|
+|  | 5x upgrade to 64 GB per node|
+  * At full load 5,900 Watts and 20,131 BTUs/hour
+  * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
+    * But a new rack (advantageous for data center)
+    * With lifetime technical support
+    * solid state drives on compute nodes
+    * 12 TB local storage
+Then
+  * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
+    * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
+  * Expand memory footprint
+    * Go to 124 GB memory/noe to beef up the CPU HPC side of things
+    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
+  * Online testing available (K20, do this)
+    * then decide on PGI compiler at purchase time
+    * maybe all Lapack libraries too
+  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
+  * Leave the 6x2TB disk space (for backup)
+    * 2U, 8 drives up to 6x4=24 TB, possible?
+  * Add an entry level Infiniband/Lustre solution
+    * for parallel file locking
+  * Spare parts
+    * 8 port switch, HCAs and cables, drives ...
+    * or get 5 years total warranty
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools