Differences

This shows you the differences between two versions of the page.

--- cluster:110 [2013/02/19 14:26] – [Specs: MW - GPU] hmeij
+++ cluster:110 [2013/05/24 13:39] (current) – [Specs: MW - GPU] hmeij
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
+===== Notes =====
+  * HP cluster off support 11/30/2013
+  * We need greentail/disk array support maybe 2 more years?
+    * Karen added to budget, Dave to approve ($2200/year)
+  * We need another disk array
+    * For robust D2D backup
+  * Pressed HP Procurve ethernet backup switch into production
+    * Dell Force 10 switch failing or traffic overwhelmed it
+    * Need a file server away from the login node
+  * We need a new cluster with support
+    * power consumption versus computational power
+    * gpu versus cpu
+    * 6 of 36 dell compute nodes have failed
 ===== GPU Specs =====
+===== Round 3 =====
+==== Specs: MW - GPU ====
+This is what we ended up buying May 2013.
+^  Topic^Description  ^
+|  General| 10 CPUs (80 cores), 20 GPUs (45,000 cuda cores), 256 gb ram/node (1,280 gb total), plus head node (128 gb)|
+|  Head Node|1x42U Rackmount System (36 drive bays), 2xXeon E5-2660 2.0 Ghz 20MB Cache 8 cores (total 16 cores)|
+|  |16x16GB 240-Pin DDR3 1600 MHz ECC (total 256gb, max 512gb), ?x10/100/1000 NIC (3 cables), 3x PCIe x16 Full, 3x PCIe x8|
+|  |2x1TB 7200RPM (Raid 1) + 16x3TB (Raid 6), Areca Raid Controller|
+|  |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s (1 cable)|
+|  |1400w Power Supply 1+1 redundant|
+|  Nodes|5x 2U Rackmountable Chassis, 5x 2 Xeon E5-2660 2.0 Ghz 20MB Cache 8 cores (16 cores/node), Sandy Bridge series|
+|  |5x 16x16GB 240-Pin DDR3 1600 MHz (256gb/node memory, max 256gb)|
+|  |5x 1x120GB SSD 7200RPM, 5x 4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio|
+|  |?x10/100/1000 NIC (1 cable), Dedicated IPMI Port, 5x 4 PCIE 3.0 x16 Slots, 5x 8 PCIE 3,0 x8 Slots|
+|  |5xConnectX-3 VPI adapter card, Single-Port, QDRFDR 40/56 Gb/s (1 cable)|
+|  |5x1620W 1+1 Redundant Power Supplies|
+|  Network|1x 1U Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) + 3m cable QDR to existing Voltaire switch|
+|  |1x 1U 24 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)|
+|Rack  |1x42U rack with power distributions (14U used)|
+|  Power|2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)|
+|  Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA|
+|  | scheduler and gnu compilers installed and configured|
+|  | Amber12 (customer provide license) , Lammps, NAMD, Cuda 4.2 (for apps) & 5 |
+|  Warranty|3 Year Parts and Labor (lifetime technical support)|
+|  GPU Teraflops|23.40 double, 70.40 single|
+|  Quote|<html><!-- estimated at $124,845 --></html>Arrived, includes S&H and Insurance|
+|Includes  |Cluster pre-installation service  |
+  * 16U - estimated draw 6,900 Watts and 23,713 BTUs cooling - $30K/year
+  * 5 GPU shelves
+  * 2 PDUs
+  * 42 TB raw
+  * FDR interconnects
+  * 120GB SSD drives on nodes
+  * 256 gb ram on nodes, 16gb/core
+  * Areca hardware raid
+  * Lifetime technical support
+==== Specs: EC GPU ====
+^  Topic^Description  ^
+|  General| 12 CPUs (96 cores), 20 GPUs (45,000 cuda cores), 128 gb ram/node (640 gb total), plus head node (128gb)|
+|  Head Node|1x2U Rackmount System, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores|
+|  |8x16GB 240-Pin DDR3 1600 MHz ECC (128gb, max 512gb), 2x10/100/1000 NIC, 1x PCIe x16 Full, 6x PCIe x8 Full|
+|  |2x2TB RAID1 7200RPM (can hold 10), ConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s|
+|  |1920w Power Supply, redundant|
+|  Nodes|6x2U Rackmountable Chassis, 6x2 Xeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series|
+|  |48x16GB 240-Pin DDR3 1600 MHz (128gb/node memory, 8gb/core, max 256gb)|
+|  |6x1TB 7200RPM, 5x4xNVIDIA Tesla K20 8 GB GPUs (4/node), 1CPU-2GPU ratio|
+|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots|
+|  |6xConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s|
+|  |6x1800W Redundant Power Supplies|
+|  Network|1x Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) +9x7' cables (2 uplink cables)|
+|  |1x 1U 16 Port Rackmount Switch, 10/100/1000, Unmanaged (+ 7' cables)|
+|  Rack & Power|42U, 4xPDU, Basic, 1U, 30A, 208V, (10) C13, Requires 1x L6-30 Power Outlet Per PDU|
+|  Software| CentOS, Bright Cluster Management (1 year support)|
+|  | Amber12 (cluster install), Lammps (shared filesystem), (no NAMD)|
+|  Storage|3U 52TB Disk Array (28x2TB) Raid 6, cascade cable|
+|  Warranty|3 Year Parts and Labor (EC technical support?)|
+|  GPU Teraflops|23.40 double, 70.40 single|
+|  Quote|<html><!-- $124,372 incl $800 S&H --></html>Arrived|
+  * 20U - estimated draw 7,400 Watts - $30K/year for cooling and power
+  * 5 GPU shelves
+  * 1 CPU shelf
+  * 4 PDU - this could be a problem!
+  * 56TB raw
+  * QDR interconnects
+  * 1 TB disk on node, makes for a large /localscratch
+  * LSI hardware raid card
 ===== Round 2 =====
@@ Line 55: / Line 150: @@
 ==== Specs: MW - CPU ====
+^  Topic^Description  ^
+|  General|13 nodes, 26 CPUs (208 cores), 128 gb ram/node (total 1,664 gb), plus head node (256gb)|
+|  Head Node|1x4U Rackmount System (36 drive bays), 2xXeon E5-2660 2.0 Ghz 20MB Cache 8 cores (total 16 cores)|
+|  |16x16GB 240-Pin DDR3 1600 MHz ECC (total 256gb, max 512gb), ?x10/100/1000 NIC (3 cables), 3x PCIe x16 Full, 3x PCIe x8|
+|  |2x1TB 7200RPM (Raid 1) + 16x3TB (Raid 6), Areca Raid Controller|
+|  |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s (1 cable)|
+|  |1400w Power Supply 1+1 redundant|
+|  Nodes|13x 2U Rackmountable Chassis, 13x 2 Xeon E5-2660 2.0 Ghz 20MB Cache 8 cores (16 cores/node), Sandy Bridge series|
+|  |13x 8x16GB 240-Pin DDR3 1600 MHz (128gb/node memory, max 256gb)|
+|  |13x 1x120GB SSD 7200RPM |
+|  |?x10/100/1000 NIC (1 cable), Dedicated IPMI Port, 4x 4 PCIE 3.0 x16 Slots, 4x 8 PCIE 3,0 x8 Slots|
+|  |13xConnectX-3 VPI adapter card, Single-Port, QDRFDR 40/56 Gb/s (1 cable)|
+|  |13x600W non Redundant Power Supplies|
+|  Network|1x 1U Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) + 3m cable QDR to existing Voltaire switch|
+|  |1x 1U 24 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)|
+|Rack  |1x42U rack with power distributions (14U used)|
+|  Power|2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)|
+|  Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA|
+|  | scheduler and gnu compilers installed and configured|
+|  | Amber12 (customer provide license) , Lammps, NAMD, Cuda 4.2 (for apps) & 5 |
+|  Warranty|3 Year Parts and Labor (lifetime technical support)|
+|  Quote|<html><!-- estimated at $104,035 --></html>Arrived, includes S&H and Insurance|
+|Includes  |Cluster pre-installation service  |
+  * 5,250 Watts and 17,913 BTUs/Hour
+  * infiniband switch (18 port needed for IPoIB) and ethernet switch (24 port)
+  * sandy bridge chip E2660 and larger memory footprint (128gb node, 256gb head node)
+  * 120GB SSD drives on nodes
+  * storage: 42TB usable Raid 6
+  * Lifetime technical support
+  * Drop software install ($3.5K savings)
+  * Spare parts
+    * ?
+  * Expand Storage
+    * upgrade to 56TB usable Raid 6 ($5.3K using 16x4TB disks)
+    * upgrade to 90TB usable Raid 60 ($10.3K using 34x3TB disks)
+  * Alternate storage:
+    * add storage server of 2.4 TB Usable 15K fast speed SAS disk ($9K-1K of 4U chassis)
+    * leave 18TB local storage on head node
@@ Line 81: / Line 219: @@
 |  Quote|<html><!-- $103,150 incl $800 S&H --></html>Arrived|
-  * 16TB Raid6 storage (10 TB usable - tight for /home)
+  * 16TB Raid6 storage (14 TB usable - tight for /home)
   * full height rack
@@ Line 107: / Line 245: @@
 |  Quote|<html><!-- $105,770 incl $800 S&H --></html>Arrived|
-  * 16TB Raid6 storage (10 TB usable - tight for /home)
+  * 16TB Raid6 storage (14 TB usable - tight for /home)
   * 1TB on nodes is wasted (unless we make fast local /localscratch at 7.2K)