User Tools

Site Tools


cluster:107

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:107 [2012/12/19 14:54]
hmeij [ConfCall & Quote: MW]
cluster:107 [2013/01/16 15:20]
hmeij [ConfCall & Quote: MW]
Line 299: Line 299:
  
   * buy a single rack and test locally, start small (will future racks be compatible?)   * buy a single rack and test locally, start small (will future racks be compatible?)
 +
 +==== Yale Qs ====
 +
 +Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ...
 +
 +  * What was the most important design element of the cluster?
 +  * What factor(s) settled the CPU to GPU ratio?
 +  * Was either, or neither, single or double precision peak performance more/less important?
 +  * What was the software suite in mind (commercial, open source, or custom code GPU "enabled")?
 +  * How did you reach out/educate users on the aspects of GPU computing?
 +  * What was the impact on the users? (recoding, recompiling)
 +  * Was the expected computational speed up realized?
 +  * Was the PGI Accelerator compilers leveraged? If so what were the results?
 +  * Do users compile with nvcc?
 +  * Does the scheduler have a resource for idle GPUs so they can be reserved?
 +  * How are the GPUs exposed/assigned to jobs the scheduler submits?
 +  * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
 +  * Can parallel jobs access mutliple GPUs across nodes?
 +  * Any experiences with pmemd.cuda.MPI (part of Amber)?
 +  * What MPI flavor is used most in regards to GPU computing?
 +  * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)
 +
 +Notes 04/01/2012 ConfCall
 +
 +  * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
 +  * Users did not share GPUs but could obtain more than one, always on same node
 +  * Experimental setup with 36 gb/node, dual 8 core chips
 +  * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
 +  * No raw code development
 +  * Speed ups was hard to tell
 +  * PGI Accelerator was used because it is needed with any Fortran code (Note!)
 +  * Double precision was most important in scientific applications
 +  * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
 +  * Book:  Programming Massively Parallel Processors, Second Edition: 
 +    * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012) 
 +    * Has examples of how to expose GPUs across nodes
  
 ==== ConfCall & Quote: AC ==== ==== ConfCall & Quote: AC ====
Line 321: Line 357:
  
 ^  Topic^Description  ^ ^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node|+|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
 |  Head Node| None| |  Head Node| None|
 |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series| |  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
Line 571: Line 607:
 |  | 5x upgrade to 64 GB per node| |  | 5x upgrade to 64 GB per node|
  
-  * At full load 5,900 Watts20,131 BTUs/hour +  * At full load 5,900 Watts and 20,131 BTUs/hour 
  
   * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical   * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
Line 577: Line 613:
     * With lifetime technical support     * With lifetime technical support
     * solid state drives on compute nodes     * solid state drives on compute nodes
 +    * 12 TB local storage
  
 Then Then
Line 582: Line 619:
   * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)   * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
     * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)     * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
-  * Execute the Upgrade packages +  * Expand memory footprint 
-    * We could go to 124 GB memory and beef up the CPU HPC side of things +    * Go to 124 GB memory/noe to beef up the CPU HPC side of things 
-    * 16 cpu cores/nodes-4 cpu/gpu cores/node=12 cpu cores using 104gb which is about 8 GB/cpu core+    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
   * Online testing available (K20, do this)   * Online testing available (K20, do this)
     * then decide on PGI compiler at purchase time     * then decide on PGI compiler at purchase time
     * maybe all Lapack libraries too     * maybe all Lapack libraries too
-  * Make the head node a compute node (in/for the future and beef it up too) +  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?
-  * Remove the 6x2TB disk space and add an entry level Infiniband/Lustre solution+  * Leave the 6x2TB disk space (for backup)  
 +    * 2U, 8 drives up to 6x4=24 TB, possible? 
 +  * Add an entry level Infiniband/Lustre solution 
 +    * for parallel file locking 
 + 
 +  * Spare parts 
 +    * 8 port switch, HCAs and cables, drives ... 
 +    * or get 5 years total warranty
  
 +  * Testing notes
 +    * Amber, LAMMPS, NAMD
 +    * cuda v4&5
 +    * install/config dirs
 +    * use gnu ... with openmpi 
 +    * make deviceQuery
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/107.txt · Last modified: 2013/09/11 13:18 by hmeij