User Tools

Site Tools


cluster:107

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:107 [2012/12/17 18:44]
hmeij [ConfCall & Quote: HP]
cluster:107 [2013/01/16 15:20]
hmeij [ConfCall & Quote: MW]
Line 299: Line 299:
  
   * buy a single rack and test locally, start small (will future racks be compatible?)   * buy a single rack and test locally, start small (will future racks be compatible?)
 +
 +==== Yale Qs ====
 +
 +Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
 +
 +  * What was the most important design element of the cluster?
 +  * What factor(s) settled the CPU to GPU ratio?
 +  * Was either, or neither, single or double precision peak performance more/less important?
 +  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
 +  * How did you reach out/educate users on the aspects of GPU computing?
 +  * What was the impact on the users? (recoding, recompiling)
 +  * Was the expected computational speed up realized?
 +  * Was the PGI Accelerator compilers leveraged? If so what were the results?
 +  * Do users compile with nvcc?
 +  * Does the scheduler have a resource for idle GPUs so they can be reserved?
 +  * How are the GPUs exposed/assigned to jobs the scheduler submits?
 +  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
 +  * Can parallel jobs access mutliple GPUs across nodes?
 +  * Any experiences with pmemd.cuda.MPI (part of Amber)?
 +  * What MPI flavor is used most in regards to GPU computing?
 +  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
 +
 +Notes 04/01/2012 ConfCall
 +
 +  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
 +  * Users did not share GPUs but could obtain more than one, always on same node
 +  * Experimental setup with 36 gb/node, dual 8 core chips
 +  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
 +  * No raw code development
 +  * Speed ups was hard to tell
 +  * PGI Accelerator was used because it is needed with any Fortran code (Note!)
 +  * Double precision was most important in scientific applications
 +  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
 +  * Book:  Programming Massively Parallel Processors, Second Edition: 
 +    * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012) 
 +    * Has examples of how to expose GPUs across nodes
  
 ==== ConfCall & Quote: AC ==== ==== ConfCall & Quote: AC ====
Line 321: Line 357:
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None| |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series| |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
Line 448: Line 484:
   * <del>First unit, single tray in chassis</del>   * <del>First unit, single tray in chassis</del>
   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement   * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement
 +
 +  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
 +    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
 +    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
 +  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option" 
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
 |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node| |  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node|
 |  Head Node|None| |  Head Node|None|
-|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL270s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|+|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
 |  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series| |  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
 |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)| |  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)|
Line 468: Line 509:
  
  
-  * To compare with “benchmark option” price wise; ??lower (25% less CPU cores)+  * To compare with “benchmark option” price wise; 37higher (25% less CPU cores)
   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)   * To compare with “benchmark option” performance; 12.5% higher (double precision peak)
  
-  * chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher +  * When quote is reduced to 1x s6500 chassis and 2x SL250s: 
-    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core+    * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores) 
-    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option") +    * To compare with “benchmark option” performance; 25% lower (double precision peak)
-  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option" +
  
   * HP on site install   * HP on site install
Line 567: Line 607:
 |  | 5x upgrade to 64 GB per node| |  | 5x upgrade to 64 GB per node|
  
-  * 2% more expansive than benchmark option, else identical+  * At full load 5,900 Watts and 20,131 BTUs/hour  
 + 
 +  * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
     * But a new rack (advantageous for data center)     * But a new rack (advantageous for data center)
     * With lifetime technical support     * With lifetime technical support
     * solid state drives on compute nodes     * solid state drives on compute nodes
 +    * 12 TB local storage
  
-  * 36 port FDR switch overkill +Then
-    * substitute with 12 port QDR switch  +
-    * and all servers to QDR +
-  * Execute the Upgrade packages+
  
 +  * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
 +    * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
 +  * Expand memory footprint
 +    * Go to 124 GB memory/noe to beef up the CPU HPC side of things
 +    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
 +  * Online testing available (K20, do this)
 +    * then decide on PGI compiler at purchase time
 +    * maybe all Lapack libraries too
 +  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
 +  * Leave the 6x2TB disk space (for backup) 
 +    * 2U, 8 drives up to 6x4=24 TB, possible?
 +  * Add an entry level Infiniband/Lustre solution
 +    * for parallel file locking
  
-  * Online testing available+  * Spare parts 
 +    * 8 port switch, HCAs and cables, drives ... 
 +    * or get 5 years total warranty
  
 +  * Testing notes
 +    * Amber, LAMMPS, NAMD
 +    * cuda v4&5
 +    * install/config dirs
 +    * use gnu ... with openmpi 
 +    * make deviceQuery
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/107.txt · Last modified: 2013/09/11 13:18 by hmeij