Table of Contents


Back

Overview

With the second NSF proposal in a “recommended for funding” state, I'm preparing this page so we can address some looming issues and make decisions on our potential new acquisition. In general, these are the main topics:

Below presents some detail regarding these points to get the discussion started.

Tada Moment!

April 27th, 2010 Update: So i had this bright idea. Spurred on by the following. The private networks on both clusters have a netmask that yields 255 IPs per subnet. Currently the clusters are chewing up about 175 of those. Now, with a new cluster potentially needing 512 more, we have a netmask and routing problem. Could be very messy.

Light bulb! So if we got a huge cluster, why not make it the primary. For example, we could run CentOS/Kusu/Lava on it or Redhat/OCS/LSF (costs involved) and adjust the netmasks to something yielding tons of IPs, like 255.255.0.0 … once running, we ingest into this cluster the petal/swallow-tail and sharptail cluster. Much cleaner and easier to do, and much more flexible. We can then logically segment the physical hardware in “clusters”.

Data Center

The breaker box panel currently provides 16 L6-30 connectors and can provide 4 more. We need to assess how much of a draw dawn there currently is, and then size out if a new subpanel can provide enough connectors for the new cluster.

Cooling requirements of the new cluster probably will be met after the data center's environment work project is finished. We probably will still be in the scenario that if a cooling unit dies, part or all of the cluster would need to be shut down preventing overheating of the entire data center.

Unfortunately, the netmask of the Dell cluster was set to 255.255.255.0 resulting in 255 IPs in subnet 192.168.1.xxx. After switches, and both clusters, about 75 IPs are available in this range. If the new cluster comes with few multi-cored nodes this may not be a problem. However, if many IPs are needed, the new cluster would reside in another subnet and the routing tables would have to change, if the clusters need to communicate with each other (as in an expansion of the Dell clusters).

Regarding the 0.5 existing FTE allocation, the new cluster would have serious impact on this. Especially if the new cluster introduces a new operating system (for example x86 solaris) and a new scheduler (for example grid engine or PBS). This will also impact the SCIC for additional documentation and user support.

Cluster/Committee Items

Home directories are currently served from a NetApp filer via NFS (5 TB). This disk space was bought with the Dell cluster from ITS (sidebar: i think this is correct). Since the filer provides lots of functionality that we do not use, we could entertain the idea of giving this space back. We would then configure the new cluster to house a 25-50 TB disk array serving up the home directories via NFS. Would require a UPS for the disk array, or put it on Enterprise UPS. Also would require all nodes on all clusters to communicate with each other, unless we kept the NetApp space and added the new disk array to new cluster only.

If the new cluster expands the existing Dell cluster we would not need another head/login node. That may add more compute nodes. No more than 75 nodes could be added before we run into routing problems.

All quotes obtained with the proposal contain compute nodes all on infiniband and ethernet switches. This may the simplest way to go configuration wise. But the need for many more compute nodes with a small memory footprint probably does not need infiniband connectivity.

The proposal, if i recall correctly, proposes the addition of a database server. Specifics as to what metrics this database server must need is unknown. None of the quotes contain a database server. Probably should be a MySQL server to be compliant with any commercial software purchased in the future. Would also require a UPS and the ability of all compute nodes to communicate with the database server. Potentially we could turn host swallowtail into a database server by adding disks and memory.

Another item of concern, is that if we move away from RHEL to for example Solaris, all the software needs to be recompiled. The new cluster will by far be the most preferred. Recompiling the software base on petaltail was an enormous task. A migration to CentOS is not a huge task as demonstrated by sharptail.

Comparison of Quotes

<!-- | quote | $ 246,237 | $ 202,175 | $ 199,528 | -->

Item Advanced Clustering Dell Sun
job slots 512 (reduce to 352 for $200K) 160 272
compute nodes blades rack server rack server
node count 64 (reduce to 44 for $200K) 20 34
chip type dual-quad E5550 dual-quad E5420 dual-quad AMD 2376
chip speed 2.26 ghz 2.5 ghz 2.3 ghz
memory 12 gb 8 gb 8 gb
hdd 250 gb 2×73 gb 250 gb
infiniband all all all
login node dual quad core E5520, 12 gb, 2×500 gb hdd none dual quad core AMD 2380, 8 gb, 4×146 gb hdd
storage Pinnacle Flexible Storage System, 12x2TB=24TB@7200KRPM, disk shelf expandable to 32TB NX4 NAS dual blade, 12x1TB=24TB@7200KRPM sunfire 4550, 16x4gb=64gb mem, 48x1TB=48TB@7200KRPM
raid SATA, 6 SATA, ? SATA, ZFS
storage costs $10K, $0.41/GB $52K, $2.12/GB $55K, $1.12/GB
storage functions multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable multiple NICs, optional 10 gb, snapshot(view), optional replication, CFIS/NFS/FTP, multiple raids (inclu 5&6), expandable to 60 TB with 4 disk shelves, deduplication, file level compression multiple NICs, ZFS with RaidZ, unlimited scalability
OS CentOS 5.x RHEL 5.x Solaris 10
software Intel C++, Fortran, MPI, MKL
scheduler gridengine platform LSF (20) gridengine
management Beo Utils, Ganglia Ganglia, NTop, Cacti Xvm
UPS 3000VA, 2U
iKVM yes
support 3 years 3 years 3 years
L6-30 13 estimated 4 estimated 8
cooling 7.75 tons
weight 3,000 lbs
cost to run /wo AC $22,463 ($0.09/kWH
nr of racks 2x42U 1x42U 1x42U

Alternate Storage Options

Instead of adopting the vendor's suggested storage device, we could insert one of these. This would require a server in front of these devices listed below which would then provide the file systems to all nodes via NFS. However, the storage devices of the vendors are capable of native NFS, so essentially add that server.

All solutions need a UPS or potentially be powered by the ITS enterprise UPS.

Also, as the size increases, we probably can not back up to the VTL anymore. In such a case, the snapshot capability is important (point in time restore). Also we could design an “archive” present on these devices (raid 1 for example). Or perform disk2disk backup copies, still on same device.

<!-- | quote | $ 48,862 | $ 81,750 | $ 49,892 | -->

Item Nexan RAID Inc Nexan
name SataBeast Xanadu SataBeast
controllers dual dual dual
cache 2gb per controller 2g per controller 2 gb per controller
size 42x1TB=42TB@7.2KRPM SATA 84x1TB=84TB@7.2KRPM SATA 14x2TB=28TB@7.2KRPM SATA
cost/GB $1.14/GB $0.96/GB $1.74/gb
protocols iSCSI/FC iSCSI/FC iSCSI/FC
raid 1,5&6 yes yes yes
features multiple volumes, multiple raids, expandable, no snapshot license dynamic pooling, optional snapshots, expandable multiple volumes, multiple raids, expandable, no snapshot license
front-end required? yes yes yes
form factor 4 U 12 U 4 U
support 3 years 3 years 3 years

If we obtained for example the Nexsan 48 TB (raw) block level device, we could entertain some new ideas. Create a 10 TB Lustre file system for fast scratch space. Create a 5-10 TB data archive to avoid data to be duplicated. Enlarge home directory space to 10-15 TB. Perform D2D backup.

Round 2 of Quotes

<!-- | quote | $ 152,283.77 (+$2K one day, on site) | $ 149,380.49 | $ 176,856.91 (on site included? check, perhaps +2.5K) | $ 149,996.49 (one week on site included) -->

Item Advanced Clustering updated Dell Dell #2 updated HP updated!
job slots 240 80 240 256
overall cost/job slot (minus head node & storage) $ 550 $ 892 $ 569 $ 415
compute nodes blades blades blades blades
node count 30 10 30 32
chip type dual-quad E5620w/12MBCache “Westmere” dual-quad E5620w/12MBCache “Westmere” dual-quad E5620w/12MBCache “Westmere” dual-quad E5620w/12MBCache “Westmere”
chip speed 2.40 ghz 2.40 ghz 2.40 ghz 2.40 ghz
memory 6×2=12 gb 6×2=12 gb 6×2=12 gb 12×1=12 gb
hdd 1×250 gb 1×146 gb (15k SAS) 1×146 gb (15k SAS) 1×160 gb
ehternet 2 x Netgear GSM7352 v2 - 48-port, 10Gb switch to storage two PowerConnect 6248 - 48 port two PowerConnect 6248 - 48, 10Gb switch to storage ProCurve Switch 2610-48 ($750), HP ProCurve 2910-48 ($3,500) both 1Gb
infiniband 72-Port 4X Configuration Modular DDR InfiniBand Switch, all nodes, includes head node and storage devices for IPoIB (see next section) (3?) 12PT 4X DDR-INFINIBAND, all nodes (3?) 12PT 4X DDR-INFINIBAND, all nodes Voltaire IB 4X QDR 36P, all nodes, plus head node for IPoIB
head node dual quad core Xeon E5620 2.40GHz w/ 12MB, 6×2=12 gb, 2×500 gb hdd (rack) XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade) XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade) HP E5620 dual-quad DL380G7, 6×2=12 gb, 2x250gb (rack)
storage Pinnacle Flexible Storage System, single Intel Xeon Quad Core X3360 2.83GHz w/ 12MB cache, 8 gb ram PV MD3200i, 6x PV MD1200, direct attach NX4 10Gb NAS (EMC2) HP StorageWorks MSA60 Array, direct attach
storage size 24x2TB=48TB @7.2K rpm, expandable to 64TB 7x (12x600gb=7.2TB)=50 TB @15K, expandable to 57.6 TB 2(12×2)=48TB @7.2K rpm, expandable 48x1TB=48TB @7.2K rpm, 4 trays total, not expandable
disk,raid SATA, 6 SAS, 5/6 SATA, 5/6 SATA, 5/6
storage costs $15K, $0.31/GB $72.5K, $1.42/GB $52K, $1.06/GB $45K, $0.92/GB
storage functions multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots imaging enabled, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable dual controller, optional snapshots, direct attach iSCSI, head node performs NFS CIFS/NFS/FTP + iSCSI and Fiber, expandable to 96 TB, snapview licensed, deduplication capable multiple NICs (on head node), direct attach, head node performs NFS duties via IPoIB, not expandable, no snapshotting
storage mgt software yes ? yes yes
OS CentOS 5.x RHEL 5.3 RHEL 5.3 RHEL5.3
software Intel C++, Fortran, MPI, MKL none dell none
scheduler gridengine Platform LSF Platform LSF gridengine
management Breakin, Cloner, Beo Utils, Act Dir, Ganglia Platform OCS5.x Platform OCS5.x HP Cluster Mgt Utility Lic and Media
UPS 3000VA UPS 5600W, 4U, 208V 5600W, 4U, 208V none
iKVM yes yes no yes
support 3 years, NBD 3 years, 4-Hour 7×24 On-site Service 3 years, NBD 3 years, NBD
L6-30 3 2(?) 4
Watts 12,891 13,943 10,602
BTU/hr 43,984 47,616 36,175
A/C Tons 3.67 4 3
weight lbs 1829.9 1,006
cost to run (9.36c/KWH) /wo A/C $10,897.28 $11,797.12 $8,952.57 (Watts+AC: 24% greener than dell - saves $5,700/year, 18% greener than ACT - saves $4,000/year)
Us Used per rack 40/42U ??/48 33/42U
Note1 unlimited lifetime technical support for all hardware and software we supply SAS drives 600 gb @15K is $750 vs 500 gb @7.2K SATA is $275, estimate is that could save $40K NX4 has more protocols than we need but so what, $1/GB is good arrives fully integrated, with knowledge transfer/training for a week on all parts of the cluster
Note2 this solution could do IPoIB, presumably yielding an NFS boost in performance, however desire native NFS, see next section Or it could be that 15K SAS drives in array is a good idea? UPS and iKVM are gone, no big deal IPoIB the unknown, we could do NFS on head node, but then a bottleneck?
Note3 all MPI flavors precompiled with Gnu, Portland (for AMD), Intel (for Xeon) Platform OCS and LSF add costs, support is nice though Total cost for OCS/LSF/RHEL is about $20K, that sets up a differentiall with ACT of $4.5K now Unsure about the lack of management/OS software preinstalled/preconfigured, what if we hit a driver problem with HP hardware? unit will be fully integrated to our specs with whatever we want (CentOS or RHEL/Gridengine)
Note4 $75K buys another 135 job slots, 17 nodes $75K buys another 80 job slots, 10 nodes there a second NX4 expansion shelf quoted which is not needed, could reduce quote by $5K $75k buys another 180 job slots, 22 nodes
Note5 with these large hard disks on compute nodes, investigate the idea if /localscratch partitions can be presented as one scratch file system via Lustre,ie 200×30= 6TB, definitely worth the effort

IPoIB

Type Bandwidth (Gbps) ~Latency (us) Price NIC+Port ($)
Gigabit Ethernet 1 40 40
10 Gigabit Ethernet 10 40 1,350
Infiniband 4x SDR 8 4 600
Infiniband 8x SDR 16 4 720
Infiniband 16x SDR 32 4 1,200

HP states

[HP Engineer] That DL380 has a P812 SAS controller, which has support for 6Gb SAS, and four 4x connectors to the MDS60 JBODs, and the RAID controller has eight 6Gb/s SAS physical links. That's a heavy-weight controller (with a 1GB cache), and the 8x PCIe bus (~4GB/s) and InfiniBand (~4GB/s) will be able to keep up with whatever the storage can provide. While the drives are listed as SAS, at that density, they are SATA internal with a SAS interface. As a rule of thumb, you might get as much as 100MB/s per drive, but realistically under normal workload (and across all parts of the platters, not just the fast outer edges), it's safer to assume ~50MB/s. If you implement RAID6 with one parity and one spare, that leaves 10 drives per MSA, or around 500MB/s per. So, it's possible that the storage could provide up to 2GB/s. To me, that seems like a very high number, which would require xfs local file system to see the local bandwidth. And I've not seen a Linux NFS file system generate more than around 6-700MB/s per server, so when we layer that on top, the storage won't be the limit. I think your question was “where will the bottleneck be?”, and I'd speculate that it's the NFS file system. If you are being asked to commit to a specific performance number, then I believe we will need to run actual benchmarks on your configuration because the software environment is subject to a lot of variability.

[Barry]The performance you can expect on IPoIB is between 200mb/s and 300mb/s based on the information provided by the HP HPC engineering teams strings of emails.

Advance Clustering states:

We have used IPoIB on several occasions with great success. Using IPoIB should take away any network bottleneck and leave the drives as the bottleneck. With NFS over IPoIB you will have a theoretical 10Gb link to your NFS storage which still should leave plenty of bandwidth for your calculations.

But, engineer warns, that IPoIB and FSS software might be unstable. Hence, the FSS shelf would be replaced with another expansion shelf, presented as a block device to head node, which now becomes an NFS server. Then IPoIB could be deployed cluster wide. Not an attractive option.

Attached is the updated quote with 10Gb connectivity on the primary storage node without IPoIB but with 10 Gb. The switch we are using has one 10Gb port. ($152,283.77) or an additional $1,070 to quote detailed above). This means traffic between storage node and switch is 10Gb and between switch and nodes is 1Gb.

ACT Questions

  1. What size was the cluster (storage, nodes)?
    1. WSU 48 nodes, 384 cores, 5 tb, all infiniband, management over ethernet
    2. UCLA 64 ACT nodes, 512 cores, added to 600 node cluster, all infiniband, 100 tb storage
    3. FSU in top 500 size wise, public and owner owned clusters, two ACT clusters 30 & 10 nodes, enterprise panasas storabe 196 tb
    4. KTZT 9 nodes, 40 tb, in two purchases
  2. What OS, scheduler? Familiarity with that in-house?
    1. WSU Torgue,
    2. UCLA CentOS, Sun gridengine (goes commercial via oracle?)
    3. FSU CentOS, Rocks, MOAB
    4. KTZT CentOS, Sun gridengine
  3. Experiences in the setup and final configuration?
    1. WSU ACT send folks for 1st cluster, second cluster was easy
    2. UCLA very experienced in house and wipe everything delivered
    3. FSU wipes all hardware
    4. KTZT on site visit ($1900), useful, no experiences before hand
  4. Was everything the way it was quoted?
    1. WSU Yes
    2. UCLA Yes
    3. FSU Yes
    4. KTZT yes
  5. What software is primarily used?
    1. WSU Amber, Gaussian
    2. UCLA All kinds, very similar to us, chem/physics/engineering/bio
    3. FSU All kinds
    4. KTZT NOAA, climate modeling
  6. What does the cluster user base look like? Curriculum usage?
    1. WSU Few active accounts (<10), single faculty member + post docs
    2. UCLA 350 users across many dept, primarily grads/faculty, no class usage
    3. FSU hundreds of users
    4. KTZT one faculty, four students
  7. How many administrators are involved? IT staff or students?
    1. WSU 15% of post doc system admin
    2. UCLA many IT persons, estimates that the ACT we have spe'ed requires one experienced person
    3. FSU large IT dept
    4. KTZT faculty
  8. Experiences with their support, hardware and software related?
    1. WSU ACT support has been excellent, prompt replacement parts, solid phone support, considers this the best part of their operation
    2. UCLA IT does not use it, but the depts folks use it, however when involved on those cases ACT support is very knowledgeable and quick, same person on all contacts/tickets (same as platform)
    3. FSU none, expecpt support, rated excellent, very knowlegeable
    4. KTZT Beyond believable, excellent, grand, both hardware and software
  9. Any items you would have done differently with them?
    1. WSU None
    2. UCLA be attentive to quote details, for example infiniband SDR vs DDR
    3. FSU NA
    4. KTZT storage is slow, probably NFS related, get more memory for cpu

HP Questions

UMICH

Other Thoughts

If we purchased just one rack via ACT or Dell (HP unlikely at this time), and we reserve $25K for database server, we could still invest in old Dell rack. For instance:

Refurbished materials from Dell. But I'm not sure this is worth the effort. Unsupported hardware. The better investment would be to increase the size of the “primary” new cluster, that is go with Advanced Clustering. But worth investigating.


Back