Differences

This shows you the differences between two versions of the page.

--- cluster:107 [2013/01/04 15:40]
hmeij [Yale Qs]
+++ cluster:107 [2013/01/28 15:52]
hmeij
@@ Line 327: / Line 327: @@
   * Experimental setup with 36 gb/node, dual 8 core chips
   * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
-  * No raw code develoment
+  * No raw code development
   * Speed ups was hard to tell
-  * PGI Accelerator was used becuase it is needed with any Fortran code (Note!)
+  * PGI Accelerator was used because it is needed with any Fortran code (Note!)
   * Double precision was most important in scientific applications
   * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
-  * Book:  Programming Massively Parallel Processors, Second Edition: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
+  * Book:  Programming Massively Parallel Processors, Second Edition:
+    * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
     * Has examples of how to expose GPUs across nodes
-==== ConfCall & Quote: AC ====
-nov12:
-  * /home and /apps mounted on CPU side. How does GPU access these? Or is job on CPU responsible for this?
-  * Single versus double precision? Both needed I assume.
-  * Unit above is Nvidia "Fermi" series, being phased out.  "Kepler" K10 and K20 series coming out. Get an earlybird unit, Jim will find out.
-  * Lava compatibility (almost certain but need to check) AC uses SGE.
-  * We do not really "know" if our current jobs would experience a boost in speed (hence one unit first - but there is a software problem here)
-  * Intel Xeon Phi Co-Processors: Intel compilers will work on this platform (which is huge!) and no programming learning curve. (HP Proliant servers with 50+ cores), Jim will find out.
-  * <del>Vendor states scheduler sees GPUs directly (but how does it then get access to home dirs, check this out)</del> ... update: this is not true, CPU job offloads to GPU
-**AC Quote**
-  * Early 2013 product line up\\
-    * [[http://www.advancedclustering.com/products/pinnacle-flex-system.html]] Kansas City, KS
-  * Quote coming for single 4U unit, which could be a one off test unit (compare to HP)
-^  Topic^Description  ^
-|  General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node|
-|  Head Node| None|
-|  Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series|
-|  |8x4GB 240-Pin DDR3 1600 MHz memory (32gb/node), 11gb/gpu, max 256gb)|
-|  |1x120GB SATA 2.5" Solid State Drive (OS drive), 7x3TB 7200RPM|
-|  |3xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio|
-|  |2x10/100/1000 NIC,  3x PCIE 3.0 x16 Slots|
-|  |1xConnectX-3 VPI adapter card, single-port 56Gb/s|
-|  |2x1620W Redundant Power Supplies|
-|  Network|1x36 port Infiniband FDR (56Gb/s) switch & 4xConnectX-3 single port FDR (56Gb/s) IB adapter  + 2x 2 meter cables (should be 4)|
-|  Power| Rack power ready|
-|  Software| None|
-|  Warranty|3 Year Parts and Labor (AC technical support)|
-|  GPU Teraflops| 3.51 double, 10.56 single|
-|  Quote|<html><!-- $33,067.43 S&H included --></html>Arrived|
-  * In order to match the "benchmark option" we need 5 units
-    * 8100 Watts, would still fit power wise but not rack wise (we'd need 20U)
-  * Single rack, 21 TB of disk space (Raid 5/6)
-  * The IB switch (plus 4 spare cards/cables) is roughly 1/3rd of the price
-    * If we remove it, we need QDR Voltaire compliant HCAs and cables (3 ports free)
-  * The config does not pack as much teraflops for the dollars; we'll see
-==== ConfCall & Quote: EC ====
-nov12:
-  * GPU hardware only
-  * scheduler never sees gpus just cpus
-  * cpu to gpu is one-to-one when using westmere chips
-  * bright cluster management (image based) - we can front end with lava
-  * what's the memory connection cpu/gpu???
-  * home dirs - cascade via voltaire 4036, need to make sure this is compatible!
-  * software on local disk? home dirs via infiniband ipoib, yes, but self install
-  * amber (charge for this) and lammps preinstalled - must be no problem, will be confirmed
-  * 2 K20 per 2 CPUs per rack 900-1000W, 1200 W power supply on each node
-  * PDU on simcluster, each node has power connection
-  * quote coming for 4 node simcluster
-  * testing periods can be staged so you are testing exactly what we're buying if simcluster if within budget (see K20 above)
-**EC Quote**
-  * [[http://exxactcorp.com/index.php/solution/solu_detail/119]] Fremont, CA
-^  Topic^Description  ^
-|  General| 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 64 gb ram/node, plus head node|
-|  Head Node|1x1U Rackmount System, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores|
-|  |8x8GB 240-Pin DDR3 1600 MHz ECC (max 256gb), 2x10/100/1000 NIC, 2x PCIe x16 Full|
-|  |2x2TB 7200RPM (can hold 10), ConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s|
-|  |600w Power Supply|
-|  Nodes|4x2U Rackmountable Chassis, 8xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16/node), Romley series|
-|  |32x8GB 240-Pin DDR3 1600 MHz (64gb/node memory, 16gb/gpu, max 256gb)|
-|  |4x1TB 7200RPM, 16xNVIDIA Tesla K20 8 GB GPUs (4/node), 1CPU-2GPU ratio|
-|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots|
-|  |4xConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s|
-|  |4x1800W Redundant Power Supplies|
-|  Network|1x Mellanox InfiniBand QDR Switch (8 ports)& HCAs (single port) + 7' cables|
-|  |1x 1U 16 Port Rackmount Switch, 10/100/1000, Unmanaged (+ 7' cables)|
-|  Power|2xPDU, Basic, 1U, 30A, 208V, (10) C13, Requires 1x L6-30 Power Outlet Per PDU|
-|  Software| CentOS, Bright Cluster Management (1 year support)|
-|  | Amber12 (cluster install), Lammps (shared filesystem), (Barracuda for weirlab?)|
-|  Warranty|3 Year Parts and Labor (EC technical support?)|
-|  GPU Teraflops|18.72 double, 56.32 single|
-|  Quote|<html><!-- $93,600 + S&H --></html>Arrived|
-  * Lets make this the "benchmark option" based on double precision
-  * In order to match this with Xeon Phis we'd need 18 of them (probably 5 4U trays)
-  * This is the (newest) simcluster design (that can be tested starting Jan 2013)
-    * 24U cabinet
-  * We could deprecate 50% of bss24 queue freeing two L6-30 connectors
-  * Spare parts:
-    * Add another HCA card to greentail and connect to Mellanox switch (long cable)
-      * also isolates GPU traffic from other clusters
-    * 1 8-port switch, 4 HCA cards, 4 long cables (for petal/swallow tails plus spare)
-  * New head node
-    * First let EC install Bright/Openlava (64 CPU cores implies 64 job slots)
-      * 16 GPUs implies 16x2,500 or 40,000 cuda cores (625 per job slot on average)
-    * Use as standalone cluster or move GPU queue to greentail
-    * If so, turn this head node into a 16 job slot ram heavy compute node?
-      * 256-512gb (Order?)
-      * add local storage? (up to 10 1or2 TB disks)
-  * Compute nodes
-    *  add local storage? (up to 10 1or2 TB disks)
-  * Bright supports openlava and GPU monitoring (get installed)
-    * [[http://www.brightcomputing.com/Linux-Cluster-Workload-Management.php]]
-    * [[http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php]]
-  * EC software install
-    * sander, sander.MPI, pmemd, pmemd.cuda (single GPU version), pmemd.cuda.MPI (the multi-GPU version)
-    * NVIDIA Toolkit v4.2. Please note that v5.0 is NOT currently supported
-    * MVAPICH2 V1.8 or later / MPICH2 v1.4p1 or later recommended, OpenMPI is NOT recommended.)
-    * make sure they do not clean source, analyze how they compiled
-    * which compiler will they use? which MPI <del>(prefer OpenMPI, have wrapper script for that)</del>
-==== ConfCall & Quote: HP ====
-HP 19nov12: meeting notes
-  * HP ProLiant SL270s Generation 8 (Gen8); 4U half width with 2 CPUs + 8 (max) GPUs
-    * The s6500 Chassis is 4U tray holding two S270s servers
-  * max 8 GPUs (20,000 cuda cores) + 2 CPUs (total 16 cores), dual drives, 256gb max
-    * K20 availability will be confirmed by Charlie
-  * power
-    * Charlie will crunch numbers of existing HPC and assess if we can use the current rack
-    * otherwise a stand alone half rack solution
-  * <del>one IB cable to Voltaire per chassis?</del> get new FDR infiniband switch, period.
-    * connect greentail with additional HCA card, or voltaire to voltaire?
-  * our software compilation problem, huge
-    * but they have great connections with Nvidia for compilation help (how to qualify that?)
-  * CMU for GPU monitoring, 3-rendering of what GPU is doing
-  * This SL270s can also support up to 8 Xeon Phi coprocessors
-    * but expect very lengthy delays, Intel is not ready for delivery (1 Phi = 1 double teraflop)
-**HP Quote**
-[[http://h18004.www1.hp.com/products/quickspecs/14405_div/14405_div.HTML]]
-  * <del>First unit, single tray in chassis</del>
-  * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement
-  * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
-    * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
-    * peak performance: 37.44 double, 112.64 single precision (twice the "benchmark option")
-  * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the "benchmark option"
-^  Topic^Description  ^
-|  General| 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node|
-|  Head Node|None|
-|  Chassis| 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank|
-|  Nodes| 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series|
-|  |3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)|
-|  |3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio|
-|  |3x2x10/100/1000 NIC, Dedicated IPMI Port, 3x8x PCIE 3.0 x16 Slots (GPU), 3x2x PCIE 3.0 x8|
-|  |3x2xIB interconnect, QDR 40Gb/s, FlexibleLOM goes into PCI3x8 slot|
-|  |chassis supplied power; 3x1x one PDU power cord (416151-B21)? - see below|
-|  Network|1xVoltaire QDR 36-port infiniband 40 Gb/s switch, + 6x 5M QSFP IB cables|
-|  |No ethernet switch, 17x 7' CAT5 RJ45 cables|
-|  Power|rack PDU ready, what is 1x HP 40A HV Core Only Corded PDU???|
-|  Software| RHEL, CMU GPU enabled (1 year support) - not on quote???|
-|  Warranty|3 Year Parts and Labor (HP technical support?)|
-|  GPU Teraflops|21.06 double, 63.36 single|
-|  Quote|<html><!-- $128,370, for a 1x6500+2xSl250 setup estimate is $95,170 --></html>Arrived (S&H and insurance?)|
-  * To compare with “benchmark option” price wise; 37% higher (25% less CPU cores)
-  * To compare with “benchmark option” performance; 12.5% higher (double precision peak)
-  * When quote is reduced to 1x s6500 chassis and 2x SL250s:
-    * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores)
-    * To compare with “benchmark option” performance; 25% lower (double precision peak)
-  * HP on site install
-  * we have 9U in HP rack available (1U for new switch)
-    * L6-30 7,500 Watts x3 PDUs (non-UPS) = 22,500 Watts - HP cluster 10,600 Watts
-    * leaves 11,898 Watts, should be sufficient for 4 SL270s(redundant power supplies)
-  * new infiniband switch, isolates GPU cluster traffic from rest of HPC
-    * 36 port IB switch overkill
-    * still need IB connection greentail to new switch (home dirs IPoIB)
-  * 1 TB local storage per node
-  * our software install problem, so is the 12.5% worth it? (with 3 trays)
-==== ConfCall & Quote: AX ====
-  * Cluster management is ROCKS (we'll pass)
-  * No scheduler (that's OK, we'll use OpenLava)
-  * They do **not** install software, only operating system and
-  * CUDA driver setup and installation
-**AX Quote**
-[[http://www.amax.com/hpc/productdetail.asp?product_id=simcluster]] Fremont, CA
-^  Topic^Description  ^
-|  General| 8 CPUs (48 cores), 12 GPUs (30,000 cuda cores), 64 gb ram/node, plus head node|
-|  Head Node|1x1U Rackmount System, 2x Intel Xeon E5-2620 2.0GHz (12 cores total)|
-|  |64GB DDR3 1333MHz (max 256gb), 2x10/100/1000 NIC, 2x PCIe x16 Full|
-|  |2x1TB (Raid 1) 7200RPM, InfiniBand  adapter card, Single-Port, QSFP 40Gb/s|
-|  |???w Power Supply, CentOS|
-|  Nodes|4x1U, 4x2xIntel Xeon E5-2650 2.0GHz, with 6 cores (12cores/node) Romley series|
-|  |4x96GB 240-Pin DDR3 1600 MHz (96gb/node memory, 8gb/gpu, max 256gb)|
-|  |4x1TB 7200RPM, 12xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio|
-|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots|
-|  |4xInfiniband adapter card, Single-Port, QSFP 40Gb/s|
-|  |4x??00W Redundant Power Supplies|
-|  Network|1x Infiniband Switch (18 ports)& HCAs (single port) + ?' cables|
-|  |1x 1U 24 Port Rackmount Switch, 10/100/1000, Unmanaged (+ ?' cables)|
-|  Power|there are 3 rack PDUs? What are the connectors, L6-30? |
-|  Software| CUDA only|
-|  Warranty|3 Year Parts and Labor (AX technical support?)|
-|  GPU Teraflops| 14.04 double,  42.96 single|
-|  Quote|<html><!-- $73,965 (S&H $800 included) --></html>Arrived|
-  * 22U cabinet
-  * Insurance during shipping is our problem (non-returnable)
-  * To compare with "benchmark option" price wise; 21% lower (25% less CPU cores)
-  * To compare with "benchmark option" performance; 22% lower (double precision peak)
-  * If we go turnkey systems having software installed is huge
-==== ConfCall & Quote: MW ====
-  * sells both individual racks and turn-key systems
-    * racks are 4U with 2 CPUs and 8 GPUs, 2200 Watts, K20X GPUs
-    * turn-key units are per customer specifications
-  * they will install **all** software components (if license keys are provided)
-    * includes CUDA drivers and setup, Amber (pmemd.cuda & pmemd.cuda.MPI, check) and Lammps
-    * but also Matlab and Mathematica if needed (wow!)
-  * standard 2 year warranty though (no biggie)
-**MW Quote**
-[[http://www.microway.com/tesla/clusters.html]] Plymouth, MA :!:
-^  Topic^Description  ^
-|  General| 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 32 gb ram/node, plus head node|
-|  Head Node|1x2U Rackmount System, 2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores|
-|  |8x4GB 240-Pin DDR3 1600 MHz ECC (max 512gb), 2x10/100/1000 NIC, 3x PCIe x16 Full, 3x PCIe x8|
-|  |2x1TB 7200RPM (Raid 1) + 6x2TB (Raid 6), Areca Raid Controller|
-|  |Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s|
-|  |740w Power Supply 1+1 redundant|
-|  Nodes|4x1U Rackmountable Chassis, 4x2 Xeon E5-2650 2.0 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series|
-|  |4x8x4GB 240-Pin DDR3 1600 MHz (32gb/node memory, 8gb/gpu, max 256gb)|
-|  |4x1x120GB SSD 7200RPM, 4x4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio|
-|  |2x10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots|
-|  |4xConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s|
-|  |4x1800W (non) Redundant Power Supplies|
-|  Network|1x Mellanox InfiniBand FDR Switch (36 ports)& HCAs (single port) + 3m cable FDR to existing Voltaire switch|
-|  |1x 1U 48 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)|
-|Rack  |1x42U rack with power distribution
-|  Power|2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)|
-|  Software| CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA 5|
-|  | scheduler and gnu compilers installed and configured|
-|  | Amber12, Lammps, Barracuda (for weirlab?), and others if desired ...bought through MW|
-|  Warranty|3 Year Parts and Labor (lifetime technical support)|
-|  GPU Teraflops|18.72 double, 56.32 single|
-|  Quote|<html><!-- estimated at $95,800 --></html>Arrived, includes S&H and Insurance|
-|Upgrades  |Cluster pre-installation service  |
-|  | 5x2 E5-2660 2.20 Ghz 8 core CPUs|
-|  | 5x upgrade to 64 GB per node|
-  * At full load 5,900 Watts and 20,131 BTUs/hour
-  * 2% more expansive than "benchmark option" (as described above with Upgrades), else identical
-    * But a new rack (advantageous for data center)
-    * With lifetime technical support
-    * solid state drives on compute nodes
-    * 12 TB local storage
-Then
-  * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
-    * and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
-  * Expand memory footprint
-    * Go to 124 GB memory/noe to beef up the CPU HPC side of things
-    * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
-  * Online testing available (K20, do this)
-    * then decide on PGI compiler at purchase time
-    * maybe all Lapack libraries too
-  * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
-  * Leave the 6x2TB disk space (for backup)
-    * 2U, 8 drives up to 6x4=24 TB, possible?
-  * Add an entry level Infiniband/Lustre solution
-    * for parallel file locking
-  * Spare parts
-    * 8 port switch, HCAs and cables, drives ...
-    * or get 5 years total warranty
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools