User Tools

Site Tools


cluster:107

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:107 [2012/12/12 21:00]
hmeij [ConfCall & Quote: EC]
cluster:107 [2013/01/04 15:40]
hmeij [Yale Qs]
Line 299: Line 299:
  
   * buy a single rack and test locally, start small (will future racks be compatible?)   * buy a single rack and test locally, start small (will future racks be compatible?)
 +
 +==== Yale Qs ====
 +
 +Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
 +
 +  * What was the most important design element of the cluster?
 +  * What factor(s) settled the CPU to GPU ratio?
 +  * Was either, or neither, single or double precision peak performance more/less important?
 +  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
 +  * How did you reach out/educate users on the aspects of GPU computing?
 +  * What was the impact on the users? (recoding, recompiling)
 +  * Was the expected computational speed up realized?
 +  * Was the PGI Accelerator compilers leveraged? If so what were the results?
 +  * Do users compile with nvcc?
 +  * Does the scheduler have a resource for idle GPUs so they can be reserved?
 +  * How are the GPUs exposed/assigned to jobs the scheduler submits?
 +  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
 +  * Can parallel jobs access mutliple GPUs across nodes?
 +  * Any experiences with pmemd.cuda.MPI (part of Amber)?
 +  * What MPI flavor is used most in regards to GPU computing?
 +  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
 +
 +Notes 04/01/2012 ConfCall
 +
 +  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
 +  * Users did not share GPUs but could obtain more than one, always on same node
 +  * Experimental setup with 36 gb/node, dual 8 core chips
 +  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
 +  * No raw code develoment
 +  * Speed ups was hard to tell
 +  * PGI Accelerator was used becuase it is needed with any Fortran code (Note!)
 +  * Double precision was most important in scientific applications
 +  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
 +  * Book:  Programming Massively Parallel Processors, Second Edition: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012) 
 +    * Has examples of how to expose GPUs across nodes
  
 ==== ConfCall & Quote: AC ==== ==== ConfCall & Quote: AC ====
Line 321: Line 356:
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None| |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series| |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
Line 367: Line 402:
 **EC Quote** **EC Quote**
  
-9U rack on wheels\\ + 
-[[http://exxactcorp.com/index.php/solution/solu_detail/119]] Fremont, CA+  [[http://exxactcorp.com/index.php/solution/solu_detail/119]] Fremont, CA
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
Line 395: Line 430:
  
   * This is the (newest) simcluster design (that can be tested starting Jan 2013)   * This is the (newest) simcluster design (that can be tested starting Jan 2013)
 +    * 24U cabinet
   * We could deprecate 50% of bss24 queue freeing two L6-30 connectors   * We could deprecate 50% of bss24 queue freeing two L6-30 connectors
   * Spare parts:   * Spare parts:
Line 447: Line 483:
   * <del>First unit, single tray in chassis</del>   * <del>First unit, single tray in chassis</del>
   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement
 +
 +  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
 +    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
 +    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
 +  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option" 
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
 |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node| |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node|
 |  Head Node|None| |  Head Node|None|
-|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL270s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank| +|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank| 
-|  Nodes| 3xSL270s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|+|  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
 |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)| |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)|
 |  |3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio| |  |3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio|
Line 464: Line 505:
 |  Warranty|3 Year Parts and Labor (HP technical support?)|  |  Warranty|3 Year Parts and Labor (HP technical support?)| 
 |  GPU Teraflops|21.06 double, 63.36 single| |  GPU Teraflops|21.06 double, 63.36 single|
-|  Quote|<html><!-- $00,000 S&--></html>Coming|+|  Quote|<html><!-- $128,370, for a 1x6500+2xSl250 setup estimate is $95,170 --></html>Arrived (S&H and insurance?)|
  
  
-  * To compare with “benchmark option” price wise; ??lower (25% less CPU cores)+  * To compare with “benchmark option” price wise; 37higher (25% less CPU cores)
   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)
  
-  * chassis in 8U + 4 SL270s + each with 8 GPUs would be a massive GPU cruncher +  * When quote is reduced to 1x s6500 chassis and 2x SL250s: 
-    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core+    * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores) 
-    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option") +    * To compare with “benchmark option” performance; 25% lower (double precision peak)
-  * 1 chassis in 4U + 2 Sl270s + each with * GPUs would the "benchmark option" +
  
   * HP on site install   * HP on site install
Line 538: Line 578:
 [[http://www.microway.com/tesla/clusters.html]] Plymouth, MA :!: [[http://www.microway.com/tesla/clusters.html]] Plymouth, MA :!:
  
-Table here ... 
  
-  Technical support for the life of the hardware +^  Topic^Description 
-  * Online testing available +|  General| 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 32 gb ram/node, plus head node| 
-  * +|  Head Node|1x2U Rackmount System, 2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores| 
 +|  |8x4GB 240-Pin DDR3 1600 MHz ECC (max 512gb), 2x10/100/1000 NIC, 3x PCIe x16 Full, 3x PCIe x8| 
 +|  |2x1TB 7200RPM (Raid 1) + 6x2TB (Raid 6), Areca Raid Controller| 
 +|  |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s| 
 +|  |740w Power Supply 1+1 redundant| 
 +|  Nodes|4x1U Rackmountable Chassis, 4x2 Xeon E5-2650 2.0 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series| 
 +|  |4x8x4GB 240-Pin DDR3 1600 MHz (32gb/node memory, 8gb/gpu, max 256gb)| 
 +|  |4x1x120GB SSD 7200RPM, 4x4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio| 
 +|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots| 
 +|  |4xConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s| 
 +|  |4x1800W (non) Redundant Power Supplies| 
 +|  Network|1x Mellanox InfiniBand FDR Switch (36 ports)& HCAs (single port) + 3m cable FDR to existing Voltaire switch| 
 +|  |1x 1U 48 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)| 
 +|Rack  |1x42U rack with power distribution 
 +|  Power|2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)| 
 +|  Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA 5| 
 +|  | scheduler and gnu compilers installed and configured| 
 +|  | Amber12, Lammps, Barracuda (for weirlab?), and others if desired ...bought through MW| 
 +|  Warranty|3 Year Parts and Labor (lifetime technical support)|  
 +|  GPU Teraflops|18.72 double, 56.32 single| 
 +|  Quote|<html><!-- estimated at $95,800 --></html>Arrived, includes S&H and Insurance| 
 +|Upgrades  |Cluster pre-installation service 
 +|  | 5x2 E5-2660 2.20 Ghz 8 core CPUs| 
 +|  | 5x upgrade to 64 GB per node| 
 + 
 +  * At full load 5,900 Watts and 20,131 BTUs/hour  
 + 
 +  * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical 
 +    * But a new rack (advantageous for data center) 
 +    With lifetime technical support 
 +    * solid state drives on compute nodes 
 +    * 12 TB local storage 
 + 
 +Then 
 + 
 +  * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps) 
 +    * and all server adapter cards to QDR (with one hook up to existing Voltaire switch) 
 +  * Expand memory footprint 
 +    * Go to 124 GB memory/noe to beef up the CPU HPC side of things 
 +    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core 
 +  * Online testing available (K20, do this) 
 +    * then decide on PGI compiler at purchase time 
 +    * maybe all Lapack libraries too 
 +  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?) 
 +  * Leave the 6x2TB disk space (for backup)  
 +    * 2U, 8 drives up to 6x4=24 TB, possible? 
 +  * Add an entry level Infiniband/Lustre solution 
 +    * for parallel file locking 
 + 
 +  * Spare parts 
 +    * 8 port switch, HCAs and cables, drives ... 
 +    * or get 5 years total warranty
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/107.txt · Last modified: 2013/09/11 13:18 by hmeij