This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:110 [2013/02/19 14:26] hmeij [Specs: MW - GPU] |
cluster:110 [2013/05/24 13:39] (current) hmeij [Specs: MW - GPU] |
||
---|---|---|---|
Line 2: | Line 2: | ||
**[[cluster: | **[[cluster: | ||
+ | |||
+ | ===== Notes ===== | ||
+ | |||
+ | * HP cluster off support 11/30/2013 | ||
+ | * We need greentail/ | ||
+ | * Karen added to budget, Dave to approve ($2200/ | ||
+ | * We need another disk array | ||
+ | * For robust D2D backup | ||
+ | * Pressed HP Procurve ethernet backup switch into production | ||
+ | * Dell Force 10 switch failing or traffic overwhelmed it | ||
+ | * Need a file server away from the login node | ||
+ | * We need a new cluster with support | ||
+ | * power consumption versus computational power | ||
+ | * gpu versus cpu | ||
+ | * 6 of 36 dell compute nodes have failed | ||
===== GPU Specs ===== | ===== GPU Specs ===== | ||
+ | |||
+ | ===== Round 3 ===== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Specs: MW - GPU ==== | ||
+ | |||
+ | This is what we ended up buying May 2013. | ||
+ | |||
+ | ^ Topic^Description | ||
+ | | General| 10 CPUs (80 cores), 20 GPUs (45,000 cuda cores), 256 gb ram/node (1,280 gb total), plus head node (128 gb)| | ||
+ | | Head Node|1x42U Rackmount System (36 drive bays), 2xXeon E5-2660 2.0 Ghz 20MB Cache 8 cores (total 16 cores)| | ||
+ | | |16x16GB 240-Pin DDR3 1600 MHz ECC (total 256gb, max 512gb), ? | ||
+ | | |2x1TB 7200RPM (Raid 1) + 16x3TB (Raid 6), Areca Raid Controller| | ||
+ | | |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, | ||
+ | | |1400w Power Supply 1+1 redundant| | ||
+ | | Nodes|5x 2U Rackmountable Chassis, 5x 2 Xeon E5-2660 2.0 Ghz 20MB Cache 8 cores (16 cores/ | ||
+ | | |5x 16x16GB 240-Pin DDR3 1600 MHz (256gb/node memory, max 256gb)| | ||
+ | | |5x 1x120GB SSD 7200RPM, 5x 4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio| | ||
+ | | |? | ||
+ | | |5xConnectX-3 VPI adapter card, Single-Port, | ||
+ | | |5x1620W 1+1 Redundant Power Supplies| | ||
+ | | Network|1x 1U Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) + 3m cable QDR to existing Voltaire switch| | ||
+ | | |1x 1U 24 Port Rackmount Switch, 10/ | ||
+ | |Rack |1x42U rack with power distributions (14U used)| | ||
+ | | Power|2xPDU, | ||
+ | | Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA| | ||
+ | | | scheduler and gnu compilers installed and configured| | ||
+ | | | Amber12 (customer provide license) , Lammps, NAMD, Cuda 4.2 (for apps) & 5 | | ||
+ | | Warranty|3 Year Parts and Labor (lifetime technical support)| | ||
+ | | GPU Teraflops|23.40 double, 70.40 single| | ||
+ | | Quote|< | ||
+ | |Includes | ||
+ | |||
+ | |||
+ | * 16U - estimated draw 6,900 Watts and 23,713 BTUs cooling - $30K/year | ||
+ | * 5 GPU shelves | ||
+ | * 2 PDUs | ||
+ | * 42 TB raw | ||
+ | * FDR interconnects | ||
+ | * 120GB SSD drives on nodes | ||
+ | * 256 gb ram on nodes, 16gb/core | ||
+ | * Areca hardware raid | ||
+ | * Lifetime technical support | ||
+ | |||
+ | ==== Specs: EC GPU ==== | ||
+ | |||
+ | |||
+ | ^ Topic^Description | ||
+ | | General| 12 CPUs (96 cores), 20 GPUs (45,000 cuda cores), 128 gb ram/node (640 gb total), plus head node (128gb)| | ||
+ | | Head Node|1x2U Rackmount System, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores| | ||
+ | | |8x16GB 240-Pin DDR3 1600 MHz ECC (128gb, max 512gb), 2x10/ | ||
+ | | |2x2TB RAID1 7200RPM (can hold 10), ConnectX-2 VPI adapter card, Single-Port, | ||
+ | | |1920w Power Supply, redundant| | ||
+ | | Nodes|6x2U Rackmountable Chassis, 6x2 Xeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series| | ||
+ | | |48x16GB 240-Pin DDR3 1600 MHz (128gb/node memory, 8gb/core, max 256gb)| | ||
+ | | |6x1TB 7200RPM, 5x4xNVIDIA Tesla K20 8 GB GPUs (4/node), 1CPU-2GPU ratio| | ||
+ | | |2x10/ | ||
+ | | |6xConnectX-2 VPI adapter card, Single-Port, | ||
+ | | |6x1800W Redundant Power Supplies| | ||
+ | | Network|1x Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) +9x7' cables (2 uplink cables)| | ||
+ | | |1x 1U 16 Port Rackmount Switch, 10/ | ||
+ | | Rack & Power|42U, 4xPDU, Basic, 1U, 30A, 208V, (10) C13, Requires 1x L6-30 Power Outlet Per PDU| | ||
+ | | Software| CentOS, Bright Cluster Management (1 year support)| | ||
+ | | | Amber12 (cluster install), Lammps (shared filesystem), | ||
+ | | Storage|3U 52TB Disk Array (28x2TB) Raid 6, cascade cable| | ||
+ | | Warranty|3 Year Parts and Labor (EC technical support? | ||
+ | | GPU Teraflops|23.40 double, 70.40 single| | ||
+ | | Quote|< | ||
+ | |||
+ | |||
+ | * 20U - estimated draw 7,400 Watts - $30K/year for cooling and power | ||
+ | * 5 GPU shelves | ||
+ | * 1 CPU shelf | ||
+ | * 4 PDU - this could be a problem! | ||
+ | * 56TB raw | ||
+ | * QDR interconnects | ||
+ | * 1 TB disk on node, makes for a large / | ||
+ | * LSI hardware raid card | ||
+ | |||
===== Round 2 ===== | ===== Round 2 ===== | ||
Line 55: | Line 150: | ||
==== Specs: MW - CPU ==== | ==== Specs: MW - CPU ==== | ||
+ | |||
+ | ^ Topic^Description | ||
+ | | General|13 nodes, 26 CPUs (208 cores), 128 gb ram/node (total 1,664 gb), plus head node (256gb)| | ||
+ | | Head Node|1x4U Rackmount System (36 drive bays), 2xXeon E5-2660 2.0 Ghz 20MB Cache 8 cores (total 16 cores)| | ||
+ | | |16x16GB 240-Pin DDR3 1600 MHz ECC (total 256gb, max 512gb), ? | ||
+ | | |2x1TB 7200RPM (Raid 1) + 16x3TB (Raid 6), Areca Raid Controller| | ||
+ | | |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, | ||
+ | | |1400w Power Supply 1+1 redundant| | ||
+ | | Nodes|13x 2U Rackmountable Chassis, 13x 2 Xeon E5-2660 2.0 Ghz 20MB Cache 8 cores (16 cores/ | ||
+ | | |13x 8x16GB 240-Pin DDR3 1600 MHz (128gb/node memory, max 256gb)| | ||
+ | | |13x 1x120GB SSD 7200RPM | | ||
+ | | |? | ||
+ | | |13xConnectX-3 VPI adapter card, Single-Port, | ||
+ | | |13x600W non Redundant Power Supplies| | ||
+ | | Network|1x 1U Mellanox InfiniBand QDR Switch (18 ports)& HCAs (single port) + 3m cable QDR to existing Voltaire switch| | ||
+ | | |1x 1U 24 Port Rackmount Switch, 10/ | ||
+ | |Rack |1x42U rack with power distributions (14U used)| | ||
+ | | Power|2xPDU, | ||
+ | | Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA| | ||
+ | | | scheduler and gnu compilers installed and configured| | ||
+ | | | Amber12 (customer provide license) , Lammps, NAMD, Cuda 4.2 (for apps) & 5 | | ||
+ | | Warranty|3 Year Parts and Labor (lifetime technical support)| | ||
+ | | Quote|< | ||
+ | |Includes | ||
+ | |||
+ | |||
+ | * 5,250 Watts and 17,913 BTUs/Hour | ||
+ | * infiniband switch (18 port needed for IPoIB) and ethernet switch (24 port) | ||
+ | * sandy bridge chip E2660 and larger memory footprint (128gb node, 256gb head node) | ||
+ | * 120GB SSD drives on nodes | ||
+ | * storage: 42TB usable Raid 6 | ||
+ | * Lifetime technical support | ||
+ | * Drop software install ($3.5K savings) | ||
+ | |||
+ | * Spare parts | ||
+ | * ? | ||
+ | * Expand Storage | ||
+ | * upgrade to 56TB usable Raid 6 ($5.3K using 16x4TB disks) | ||
+ | * upgrade to 90TB usable Raid 60 ($10.3K using 34x3TB disks) | ||
+ | |||
+ | * Alternate storage: | ||
+ | * add storage server of 2.4 TB Usable 15K fast speed SAS disk ($9K-1K of 4U chassis) | ||
+ | * leave 18TB local storage on head node | ||
Line 81: | Line 219: | ||
| Quote|< | | Quote|< | ||
- | * 16TB Raid6 storage (10 TB usable - tight for /home) | + | * 16TB Raid6 storage (14 TB usable - tight for /home) |
* full height rack | * full height rack | ||
Line 107: | Line 245: | ||
| Quote|< | | Quote|< | ||
- | * 16TB Raid6 storage (10 TB usable - tight for /home) | + | * 16TB Raid6 storage (14 TB usable - tight for /home) |
* 1TB on nodes is wasted (unless we make fast local / | * 1TB on nodes is wasted (unless we make fast local / | ||