\\
**[[cluster:0|Back]]**


==== Overview ====

With the second NSF proposal in a "recommended for funding" state, I'm preparing this page so we can address some looming issues and make decisions on our potential new acquisition.  In general, these are the main topics:

  * Data Center/ITS Items
    * Increased cooling requirements
    * More electric L6-30 connectors
    * 192.168.1.xxx subnet
    * FTE requirments

  * Cluster/Committee Items
    * Increase Home Directory Storage
    * More infiniband nodes
    * More nodes with small memory footprint
    * Database server

  * New ventures
    * Should we investigate [[http://www.platform.com/private-cloud-computing/hpc-cloud|Platform ISF]], private cloud for HPC, will do both windows and linux compute nodes in adaptive mode (meaning in virtualized environment).

Below presents some detail regarding these points to get the discussion started.

==== Tada Moment! ====

April 27th, 2010 Update:  So i had this bright idea.  Spurred on by the following.  The private networks on both clusters have a netmask that yields 255 IPs per subnet.  Currently the clusters are chewing up about 175 of those.  Now, with a new cluster potentially needing 512 more, we have a netmask and routing problem.  Could be very messy.

Light bulb!  So if we got a huge cluster, why not make it the primary.  For example, we could run CentOS/Kusu/Lava on it or Redhat/OCS/LSF (costs involved) and adjust the netmasks to something yielding tons of IPs, like 255.255.0.0 ... once running, we ingest into this cluster the petal/swallow-tail and sharptail cluster.  Much cleaner and easier to do, and much more flexible. We can then logically segment the physical hardware in "clusters".


==== Data Center ====

The breaker box panel currently provides 16 L6-30 connectors and can provide 4 more. We need to assess how much of a draw dawn there currently is, and then size out if a new subpanel can provide enough connectors for the new cluster.

Cooling requirements of the new cluster probably will be met after the data center's environment work project is finished.  We probably will still be in the scenario that if a cooling unit dies, part or all of the cluster would need to be shut down preventing overheating of the entire data center.

Unfortunately, the netmask of the Dell cluster was set to 255.255.255.0 resulting in 255 IPs in subnet 192.168.1.xxx. After switches, and both clusters, about 75 IPs are available in this range.  If the new cluster comes with few multi-cored nodes this may not be a problem.  However, if many IPs are needed, the new cluster would reside in another subnet and the routing tables would have to change, if the clusters need to communicate with each other (as in an expansion of the Dell clusters).

Regarding the 0.5 existing FTE allocation, the new cluster would have serious impact on this.  Especially if the new cluster introduces a new operating system (for example x86 solaris) and a new scheduler (for example grid engine or PBS). This will also impact the SCIC for additional documentation and user support.

==== Cluster/Committee Items ====

Home directories are currently served from a NetApp filer via NFS (5 TB). This disk space was bought with the Dell cluster from ITS (sidebar: i think this is correct). Since the filer provides lots of functionality that we do not use, we could entertain the idea of giving this space back.  We would then configure the new cluster to house a 25-50 TB disk array serving up the home directories via NFS.  Would require a UPS for the disk array, or put it on Enterprise UPS. Also would require all nodes on all clusters to communicate with each other, unless we kept the NetApp space and added the new disk array to new cluster only.

If the new cluster expands the existing Dell cluster we would not need another head/login node. That may add more compute nodes. No more than 75 nodes could be added before we run into routing problems.  

All quotes obtained with the proposal contain compute nodes all on infiniband and ethernet switches. This may the simplest way to go configuration wise.  But the need for many more compute nodes with a small memory footprint probably does not need infiniband connectivity.

The proposal, if i recall correctly, proposes the addition of a database server.  Specifics as to what metrics this database server must need is unknown.  None of the quotes contain a database server. Probably should be a MySQL server to be compliant with any commercial software purchased in the future. Would also require a UPS and the ability of all compute nodes to communicate with the database server. Potentially we could turn host swallowtail into a database server by adding disks and memory.

Another item of concern, is that if we move away from RHEL to for example Solaris, all the software needs to be recompiled.  The new cluster will by far be the most preferred.  Recompiling the software base on petaltail was an enormous task.  A migration to CentOS is not a huge task as demonstrated by sharptail.

==== Comparison of Quotes ====

<html>
<!--
|  quote  |  $ 246,237  |  $ 202,175  |  $ 199,528  |
-->
</html>

^  Item  ^  Advanced Clustering  ^  Dell  ^  Sun  ^
|  job slots  |  512 (reduce to 352 for $200K)  |  160  |  272  |
|  compute nodes  |  blades  |  rack server  |  rack server  |
|  node count  |  64 (reduce to 44 for $200K)  |  20  |  34  |
|  chip type  |  dual-quad E5550  |  dual-quad E5420  |  dual-quad AMD 2376  |
|  chip speed  |  2.26 ghz  |  2.5 ghz  |  2.3 ghz  |
|  memory  |  12 gb  |  8 gb  |  8 gb  |
|  hdd  |  250 gb  |  2x73 gb  |  250 gb  |
|  infiniband  |  all  |  all  |  all  |
|  login node  |  dual quad core E5520, 12 gb, 2x500 gb hdd  |  none  |  dual quad core AMD 2380, 8 gb, 4x146 gb hdd  |
|  storage  |  Pinnacle Flexible Storage System, 12x2TB=24TB@7200KRPM, disk shelf expandable to 32TB  |  NX4 NAS dual blade, 12x1TB=24TB@7200KRPM   |  sunfire 4550, 16x4gb=64gb mem, 48x1TB=48TB@7200KRPM  |
|  raid  |  SATA, 6  |  SATA, ?  |  SATA, ZFS  |
|  storage costs  |  $10K, $0.41/GB  |  $52K, $2.12/GB  |  $55K, $1.12/GB  |
|  storage functions  |  multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable  |  multiple NICs, optional 10 gb, snapshot(view), optional replication, CFIS/NFS/FTP, multiple raids (inclu 5&6), expandable to 60 TB with 4 disk shelves, deduplication, file level compression  |  multiple NICs, ZFS with RaidZ,  unlimited scalability  |
|  OS  |  CentOS 5.x   |  RHEL 5.x  |  Solaris 10  |
|  software  |  Intel C++, Fortran, MPI, MKL  |    |    |
|  scheduler  |  gridengine  |  platform LSF (20)  |  gridengine  |
|  management  |  Beo Utils, Ganglia  |  Ganglia, NTop, Cacti  |  Xvm  |
|  UPS  |  3000VA, 2U  |    |    |
|  iKVM  |  yes  |    |    |
|  support  |  3 years  |  3 years  |  3 years  |
|  L6-30  |  13  |  estimated 4  |  estimated 8  |
|  cooling   |  7.75 tons  |    |    |
|  weight   |  3,000 lbs  |    |    |
|  cost to run /wo AC  |  $22,463 ($0.09/kWH  |    |    |
|  nr of racks  |  2x42U  |  1x42U  |  1x42U  |


==== Alternate Storage Options ====

Instead of adopting the vendor's suggested storage device, we could insert one of these.  This would require a server in front of these devices listed below which would then provide the file systems to all nodes via NFS.  However, the storage devices of the vendors are capable of native NFS, so essentially add that server.

All solutions need a UPS or potentially be powered by the ITS enterprise UPS.

Also, as the size increases, we probably can not back up to the VTL anymore.  In such a case, the snapshot capability is important (point in time restore).  Also we could design an "archive" present on these devices (raid 1 for example).  Or perform disk2disk backup copies, still on same device.

<html>
<!--
|  quote  |  $ 48,862  |  $ 81,750  |  $ 49,892  |
-->
</html>

^  Item  ^  Nexan  ^  RAID Inc  ^  Nexan  ^      
|  name  |  SataBeast  |  Xanadu  |  SataBeast  |    
|  controllers  |  dual   |  dual  |  dual  |    
|  cache  |  2gb per controller  |  2g per controller  |  2 gb per controller  |    
|  size  |  42x1TB=42TB@7.2KRPM SATA  |  84x1TB=84TB@7.2KRPM SATA  |  14x2TB=28TB@7.2KRPM SATA  |    
|  cost/GB  |  $1.14/GB  |  $0.96/GB  |  $1.74/gb  |    
|  protocols  |  iSCSI/FC  |  iSCSI/FC  |  iSCSI/FC  |    
|  raid 1,5&6  |  yes  |  yes  |  yes  |    
|  features  |  multiple volumes, multiple raids, expandable, no snapshot license | dynamic pooling, optional snapshots, expandable  |  multiple volumes, multiple raids, expandable, no snapshot license  |    
|  front-end required?  |  yes  |  yes  |  yes  |    
|  form factor  |  4 U  |  12 U  |  4 U  |    
|  support  |  3 years  |  3 years  |  3 years  |    

If we obtained for example the Nexsan 48 TB (raw) block level device, we could entertain some new ideas.  Create a 10 TB Lustre file system for fast scratch space.  Create a 5-10 TB data archive to avoid data to be duplicated.  Enlarge home directory space to 10-15 TB.  Perform D2D backup.

==== Round 2 of Quotes ====

<html><!--
|  quote  |  $ 152,283.77 (+$2K one day, on site)  |  $ 149,380.49   |  $ 176,856.91 (on site included? check, perhaps +2.5K)   |  $ 149,996.49 (one week on site included)   
-->
</html>

^  Item  ^  Advanced Clustering //updated//  ^  Dell  ^  Dell #2 //updated//  ^  HP //updated!//  ^
|
|  job slots  |  240  |  80  |  240  |  256  |
|  overall cost/job slot (minus head node & storage)  |  $ 550  |  $ 892  |  $ 569  |  $ <del>415</del>  |
|  compute nodes  |  blades  |  blades  |  blades  |  blades  |
|  node count  |  30  |  10  |   30  |  32   |
|  chip type  |  dual-quad E5620w/12MBCache "Westmere"  |  dual-quad E5620w/12MBCache "Westmere"   |  dual-quad E5620w/12MBCache "Westmere"  |  dual-quad E5620w/12MBCache "Westmere"   |
|  chip speed  |  2.40 ghz  |   2.40 ghz  |  2.40 ghz  |   2.40 ghz  |
|  memory  |  6x2=12 gb  |   6x2=12 gb  |  6x2=12 gb  |   12x1=12 gb  |
|  hdd  |  1x250 gb  |  1x146 gb (15k SAS)  |  1x146 gb (15k SAS)  |  1x160 gb  |
|  ehternet  |  2 x Netgear GSM7352 v2 - 48-port, **10Gb switch to storage**  |  two PowerConnect 6248 - 48 port |  two PowerConnect 6248 - 48, **10Gb switch to storage**  |    ProCurve Switch 2610-48 ($750), HP ProCurve 2910-48 ($3,500) both **1Gb**  | 
|  infiniband  |  72-Port 4X Configuration Modular DDR InfiniBand Switch, all nodes, <del>includes head node and storage devices for IPoIB</del> (see next section)  |  (3?) 12PT 4X DDR-INFINIBAND, all nodes  |  (3?) 12PT 4X DDR-INFINIBAND, all nodes  |  Voltaire IB 4X **QDR** 36P, all nodes, plus head node for IPoIB  |
|  head node  |  dual quad core Xeon E5620 2.40GHz w/ 12MB, 6x2=12 gb, 2x500 gb hdd (rack)  |  XeonE5620 2.4Ghz, 12M Cache, 6x2=12 gb, 2(?)x146 gb hdd (blade)  |  XeonE5620 2.4Ghz, 12M Cache, 6x2=12 gb, 2(?)x146 gb hdd (blade)  |  HP E5620 dual-quad DL380G7, 6x2=12 gb, 2x250gb (rack)   |
|  storage  |  Pinnacle Flexible Storage System, single  Intel Xeon Quad Core X3360 2.83GHz w/ 12MB cache, 8 gb ram   |  PV MD3200i, 6x PV MD1200, direct attach   |  NX4 10Gb NAS (EMC2)  |  HP StorageWorks MSA60 Array, **direct attach**  |
|  storage size  |  24x2TB=48TB @7.2K rpm, expandable to 64TB  |   7x (12x600gb=7.2TB)=50 TB @15K, expandable to 57.6 TB  |  2(12x2)=48TB @7.2K rpm, expandable  |  48x1TB=48TB @7.2K rpm, 4 trays total, not expandable  |
|  disk,raid  |  SATA, 6   |  SAS, 5/6  |  SATA, 5/6  |  SATA, 5/6  |
|  storage costs  |  $15K, $0.31/GB  |  $72.5K, $1.42/GB  |  $52K, $1.06/GB  |  $45K, $0.92/GB  |
|  storage functions  |  multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, **snapshots imaging enabled**, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable  |  dual controller, optional snapshots, direct attach iSCSI, head node performs NFS  |  CIFS/NFS/FTP + iSCSI and Fiber, expandable to 96 TB, **snapview licensed**, deduplication capable  |  multiple NICs (on  head node), direct attach, head node performs NFS duties via IPoIB, not expandable, **no snapshotting**  |
|  storage mgt software  |  yes  |  ?  |  yes  |  yes  |
|  OS  |  CentOS 5.x   |  RHEL 5.3  |  RHEL 5.3  |  RHEL5.3   |
|  software  |  Intel C++, Fortran, MPI, MKL  |  none  |  dell  |  none  |
|  scheduler  |  gridengine  |  Platform LSF  |  Platform LSF  |  gridengine  |
|  management  |  Breakin, Cloner, Beo Utils, Act Dir, Ganglia  |  Platform OCS5.x  |  Platform OCS5.x  |  HP Cluster Mgt Utility Lic and Media    |
|  UPS  |  3000VA UPS  |  5600W, 4U, 208V  |  5600W, 4U, 208V  |  **none**  |
|  iKVM  |  yes  |  yes  |  no  |  yes  |
|  support  |  3 years, NBD  |  3 years, 4-Hour 7x24 On-site Service  |  3 years, NBD  |  3 years, NBD  |
|  L6-30  |  3  |    |  2(?)  |  4  |
|  Watts  |  12,891  |    |  13,943  |  10,602  |
|  BTU/hr  |  43,984  |    |    47,616  |  36,175  |
|  A/C Tons   |  3.67  |    |  4  |  3  |
|  weight lbs  |  1829.9  |    |    |  1,006  |
|  cost to run (9.36c/KWH) /wo A/C  |  $10,897.28   |    |  $11,797.12  |  $8,952.57 (Watts+AC: 24% greener than dell - saves $5,700/year, 18% greener than ACT - saves $4,000/year)  |
|  Us Used per rack |  40/42U  |  ??/48  |    |  33/42U  |
|  Note1  |  unlimited lifetime technical support for all hardware and software we supply  |  SAS drives 600 gb @15K is $750 vs  500 gb @7.2K SATA is $275, estimate is that could save $40K  |    NX4 has more protocols than we need but so what, $1/GB is good  |  arrives fully integrated, with  knowledge transfer/training for a week on all parts of the cluster  |
|  Note2  |  this solution could do IPoIB, presumably yielding an NFS boost in performance, however desire native NFS, see next section   |   Or it could be that 15K SAS drives in array is a good idea?  |   UPS and iKVM are gone, no big deal  |   IPoIB the unknown, we could do NFS on head node, but then a bottleneck?  |
|  Note3  |  all MPI flavors precompiled with Gnu, Portland (for AMD), Intel (for Xeon)  |  Platform OCS and LSF add costs, support is nice though  |  Total cost for OCS/LSF/RHEL is about $20K, that sets up a differentiall with ACT of $4.5K now  |  <del>Unsure about the lack of management/OS software preinstalled/preconfigured, what if we hit a driver problem with HP hardware?</del> unit will be fully integrated to our specs with whatever we want (CentOS or RHEL/Gridengine)  |
|  Note4  |  $75K buys another 135 job slots, 17 nodes  |  $75K buys another 80 job slots, 10 nodes  |  there a second NX4 expansion shelf quoted which is not needed, could reduce quote by $5K  |  $75k buys another 180 job slots, 22 nodes  |
|  Note5  | with these large hard disks on compute nodes, investigate the idea if /localscratch partitions can be presented as one scratch file system via Lustre,ie 200x30= 6TB, definitely worth the effort  ||||

==== IPoIB ====

^  Type  ^  Bandwidth (Gbps)  ^ ~Latency (us)  ^  Price NIC+Port ($)  |
|Gigabit Ethernet|  1|  40|  40|
|10 Gigabit Ethernet|  10|  40|  1,350|
|Infiniband 4x SDR|  8|  4|  600|
|Infiniband 8x SDR|  16|  4|  720|
|Infiniband 16x SDR|  32|  4|  1,200|

  * For point-to-point throughput, IP over InfiniBand (Connected Mode) is comparable to a native InfiniBand
  * When a disk is a bottleneck, NFS can not benefit from IPoIB
  * When a disk is not a bottleneck, NFS benefits significantly from both IPoIB

** HP states  ** 

//[HP Engineer]
That DL380 has a P812 SAS controller, which has support for 6Gb SAS, and four 4x connectors to the MDS60 JBODs, and the RAID controller has eight 6Gb/s SAS physical links.

 
  That's a heavy-weight controller (with a 1GB cache), and the 8x PCIe bus (~4GB/s) and InfiniBand (~4GB/s) will be able to keep up with whatever the storage can provide.  While the drives are listed as SAS, at that density, they are SATA internal with a SAS interface.  As a rule of thumb, you might get as much as 100MB/s per drive, but realistically under normal workload (and across all parts of the platters, not just the fast outer edges), it's safer to assume ~50MB/s.  If you implement RAID6 with one parity and one spare, that leaves 10 drives per MSA, or around 500MB/s per.  So, it's possible that the storage could provide up to 2GB/s.  To me, that seems like a very high number, which would require xfs local file system to see the local bandwidth.  And I've not seen a Linux NFS file system generate more than around 6-700MB/s per server, so when we layer that on top, the storage won't be the limit.

  I think your question was "where will the bottleneck be?", and I'd speculate that it's the NFS file system.  If you are being asked to commit to a specific performance number, then I believe we will need to run actual benchmarks on your configuration because the software environment is subject to a lot of variability.//


//[Barry]The performance you can expect on IPoIB is between 200mb/s and 300mb/s based on the information provided by the HP HPC engineering teams strings of emails.//

**Advance Clustering states:** 

//We have used IPoIB on several occasions with great success. Using IPoIB should take away any network bottleneck and leave the drives as the bottleneck. With NFS over IPoIB you will have a theoretical 10Gb link to your NFS storage which still should leave plenty of bandwidth for your calculations//. 

But, engineer warns, that IPoIB and FSS software might be unstable.  Hence, the FSS shelf would be replaced with another expansion shelf, presented as a block device to head node, which now becomes an NFS server.  Then IPoIB could be deployed cluster wide.  Not an attractive option.  

//Attached is the updated quote with 10Gb connectivity on the primary storage node without IPoIB but with 10 Gb.  The switch we are using has one 10Gb port. ($152,283.77) or an additional $1,070 to quote detailed above).  This means traffic between storage node and switch is 10Gb and between switch and nodes is 1Gb.//

==== ACT Questions ====


  - What size was the cluster (storage, nodes)?
    - WSU 48 nodes, 384 cores, 5 tb, all infiniband, management over ethernet
    - UCLA 64 ACT nodes, 512 cores, added to 600 node cluster, all infiniband, 100 tb storage
    - FSU in top 500 size wise, public and owner owned clusters, two ACT clusters 30 & 10 nodes, enterprise panasas storabe 196 tb
    - KTZT 9 nodes, 40 tb, in two purchases
  - What OS, scheduler? Familiarity with that in-house?
    - WSU Torgue, 
    - UCLA CentOS, Sun gridengine (goes commercial via oracle?)
    - FSU CentOS, Rocks, MOAB
    - KTZT CentOS, Sun gridengine
  - Experiences in the setup and final configuration?
    - WSU ACT send folks for 1st cluster, second cluster was easy 
    - UCLA very experienced in house and wipe everything delivered
    - FSU wipes all hardware
    - KTZT on site visit ($1900), useful, no experiences before hand
  - Was everything the way it was quoted?
    - WSU Yes
    - UCLA Yes
    - FSU Yes
    - KTZT yes
  - What software is primarily used?
    - WSU Amber, Gaussian
    - UCLA All kinds, very similar to us, chem/physics/engineering/bio
    - FSU All kinds
    - KTZT NOAA, climate modeling
  - What does the cluster user base look like? Curriculum usage?
    - WSU Few active accounts (<10), single faculty member + post docs
    - UCLA 350 users across many dept, primarily grads/faculty, no class usage
    - FSU hundreds of users
    - KTZT one faculty, four students
  - How many administrators are involved? IT staff or students?
    - WSU 15% of post doc system admin
    - UCLA many IT persons, estimates that the ACT we have spe'ed requires one experienced person
    - FSU large IT dept
    - KTZT faculty 
  - Experiences with their support, hardware and software related?
    - WSU ACT support has been excellent, prompt replacement parts, solid phone support, considers this the best part of their operation
    - UCLA IT does not use it, but the depts folks use it, however when involved on those cases ACT support is very knowledgeable and quick, same person on all contacts/tickets (same as platform)
    - FSU none, expecpt support, rated excellent, very knowlegeable
    - KTZT Beyond believable, excellent, grand, both hardware and software
  - Any items you would have done differently with them?
    - WSU None
    - UCLA be attentive to quote details, for example infiniband SDR vs DDR
    - FSU NA
    - KTZT storage is slow, probably NFS related, get more memory for cpu

==== HP Questions ====

**UMICH**
  * Dell complacent, HP proactive (buys $1M in 3 years, 2852 cores)
  * No experience with software
  * Good HP support experience
  * HP more efficient than Dell, supports that
  * Torque/PBS shop
  * No IPoIB experience (yet)
  * Also a Dell shop


==== Other Thoughts ====

If we purchased just one rack via ACT or Dell (HP unlikely at this time), and we reserve $25K for database server, we could still invest in old Dell rack.  For instance:

  * collapse ehw and ehwfd
    * buy 2 more md1000 with even larger 15k disks
    * makes a single queue of 8 nodes
  * collapse elw, emw 
    * universal 16 gb memory footprint per node
    * makes a single queue of 12 nodes
  * imw
    * double memory to 16 gb/node

Refurbished materials from Dell.  But I'm not sure this is worth the effort.  Unsupported hardware.  The better investment would be to increase the size of the "primary" new cluster, that is go with Advanced Clustering.  But worth investigating.


\\
**[[cluster:0|Back]]**