With the second NSF proposal in a “recommended for funding” state, I'm preparing this page so we can address some looming issues and make decisions on our potential new acquisition. In general, these are the main topics:
Below presents some detail regarding these points to get the discussion started.
April 27th, 2010 Update: So i had this bright idea. Spurred on by the following. The private networks on both clusters have a netmask that yields 255 IPs per subnet. Currently the clusters are chewing up about 175 of those. Now, with a new cluster potentially needing 512 more, we have a netmask and routing problem. Could be very messy.
Light bulb! So if we got a huge cluster, why not make it the primary. For example, we could run CentOS/Kusu/Lava on it or Redhat/OCS/LSF (costs involved) and adjust the netmasks to something yielding tons of IPs, like 255.255.0.0 … once running, we ingest into this cluster the petal/swallow-tail and sharptail cluster. Much cleaner and easier to do, and much more flexible. We can then logically segment the physical hardware in “clusters”.
The breaker box panel currently provides 16 L6-30 connectors and can provide 4 more. We need to assess how much of a draw dawn there currently is, and then size out if a new subpanel can provide enough connectors for the new cluster.
Cooling requirements of the new cluster probably will be met after the data center's environment work project is finished. We probably will still be in the scenario that if a cooling unit dies, part or all of the cluster would need to be shut down preventing overheating of the entire data center.
Unfortunately, the netmask of the Dell cluster was set to 255.255.255.0 resulting in 255 IPs in subnet 192.168.1.xxx. After switches, and both clusters, about 75 IPs are available in this range. If the new cluster comes with few multi-cored nodes this may not be a problem. However, if many IPs are needed, the new cluster would reside in another subnet and the routing tables would have to change, if the clusters need to communicate with each other (as in an expansion of the Dell clusters).
Regarding the 0.5 existing FTE allocation, the new cluster would have serious impact on this. Especially if the new cluster introduces a new operating system (for example x86 solaris) and a new scheduler (for example grid engine or PBS). This will also impact the SCIC for additional documentation and user support.
Home directories are currently served from a NetApp filer via NFS (5 TB). This disk space was bought with the Dell cluster from ITS (sidebar: i think this is correct). Since the filer provides lots of functionality that we do not use, we could entertain the idea of giving this space back. We would then configure the new cluster to house a 25-50 TB disk array serving up the home directories via NFS. Would require a UPS for the disk array, or put it on Enterprise UPS. Also would require all nodes on all clusters to communicate with each other, unless we kept the NetApp space and added the new disk array to new cluster only.
If the new cluster expands the existing Dell cluster we would not need another head/login node. That may add more compute nodes. No more than 75 nodes could be added before we run into routing problems.
All quotes obtained with the proposal contain compute nodes all on infiniband and ethernet switches. This may the simplest way to go configuration wise. But the need for many more compute nodes with a small memory footprint probably does not need infiniband connectivity.
The proposal, if i recall correctly, proposes the addition of a database server. Specifics as to what metrics this database server must need is unknown. None of the quotes contain a database server. Probably should be a MySQL server to be compliant with any commercial software purchased in the future. Would also require a UPS and the ability of all compute nodes to communicate with the database server. Potentially we could turn host swallowtail into a database server by adding disks and memory.
Another item of concern, is that if we move away from RHEL to for example Solaris, all the software needs to be recompiled. The new cluster will by far be the most preferred. Recompiling the software base on petaltail was an enormous task. A migration to CentOS is not a huge task as demonstrated by sharptail.
| quote | $ 246,237 | $ 202,175 | $ 199,528 |
|job slots||512 (reduce to 352 for $200K)||160||272|
|compute nodes||blades||rack server||rack server|
|node count||64 (reduce to 44 for $200K)||20||34|
|chip type||dual-quad E5550||dual-quad E5420||dual-quad AMD 2376|
|chip speed||2.26 ghz||2.5 ghz||2.3 ghz|
|memory||12 gb||8 gb||8 gb|
|hdd||250 gb||2×73 gb||250 gb|
|login node||dual quad core E5520, 12 gb, 2×500 gb hdd||none||dual quad core AMD 2380, 8 gb, 4×146 gb hdd|
|storage||Pinnacle Flexible Storage System, 12x2TB=24TB@7200KRPM, disk shelf expandable to 32TB||NX4 NAS dual blade, 12x1TB=24TB@7200KRPM||sunfire 4550, 16x4gb=64gb mem, 48x1TB=48TB@7200KRPM|
|raid||SATA, 6||SATA, ?||SATA, ZFS|
|storage costs||$10K, $0.41/GB||$52K, $2.12/GB||$55K, $1.12/GB|
|storage functions||multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable||multiple NICs, optional 10 gb, snapshot(view), optional replication, CFIS/NFS/FTP, multiple raids (inclu 5&6), expandable to 60 TB with 4 disk shelves, deduplication, file level compression||multiple NICs, ZFS with RaidZ, unlimited scalability|
|OS||CentOS 5.x||RHEL 5.x||Solaris 10|
|software||Intel C++, Fortran, MPI, MKL|
|scheduler||gridengine||platform LSF (20)||gridengine|
|management||Beo Utils, Ganglia||Ganglia, NTop, Cacti||Xvm|
|support||3 years||3 years||3 years|
|L6-30||13||estimated 4||estimated 8|
|cost to run /wo AC||$22,463 ($0.09/kWH|
|nr of racks||2x42U||1x42U||1x42U|
Instead of adopting the vendor's suggested storage device, we could insert one of these. This would require a server in front of these devices listed below which would then provide the file systems to all nodes via NFS. However, the storage devices of the vendors are capable of native NFS, so essentially add that server.
All solutions need a UPS or potentially be powered by the ITS enterprise UPS.
Also, as the size increases, we probably can not back up to the VTL anymore. In such a case, the snapshot capability is important (point in time restore). Also we could design an “archive” present on these devices (raid 1 for example). Or perform disk2disk backup copies, still on same device.
| quote | $ 48,862 | $ 81,750 | $ 49,892 |
|cache||2gb per controller||2g per controller||2 gb per controller|
|size||42x1TB=42TB@7.2KRPM SATA||84x1TB=84TB@7.2KRPM SATA||14x2TB=28TB@7.2KRPM SATA|
|features||multiple volumes, multiple raids, expandable, no snapshot license||dynamic pooling, optional snapshots, expandable||multiple volumes, multiple raids, expandable, no snapshot license|
|form factor||4 U||12 U||4 U|
|support||3 years||3 years||3 years|
If we obtained for example the Nexsan 48 TB (raw) block level device, we could entertain some new ideas. Create a 10 TB Lustre file system for fast scratch space. Create a 5-10 TB data archive to avoid data to be duplicated. Enlarge home directory space to 10-15 TB. Perform D2D backup.
| quote | $ 152,283.77 (+$2K one day, on site) | $ 149,380.49 | $ 176,856.91 (on site included? check, perhaps +2.5K) | $ 149,996.49 (one week on site included)
|Item||Advanced Clustering updated||Dell||Dell #2 updated||HP updated!|
|overall cost/job slot (minus head node & storage)||$ 550||$ 892||$ 569|| $
|chip type||dual-quad E5620w/12MBCache “Westmere”||dual-quad E5620w/12MBCache “Westmere”||dual-quad E5620w/12MBCache “Westmere”||dual-quad E5620w/12MBCache “Westmere”|
|chip speed||2.40 ghz||2.40 ghz||2.40 ghz||2.40 ghz|
|memory||6×2=12 gb||6×2=12 gb||6×2=12 gb||12×1=12 gb|
|hdd||1×250 gb||1×146 gb (15k SAS)||1×146 gb (15k SAS)||1×160 gb|
|ehternet||2 x Netgear GSM7352 v2 - 48-port, 10Gb switch to storage||two PowerConnect 6248 - 48 port||two PowerConnect 6248 - 48, 10Gb switch to storage||ProCurve Switch 2610-48 ($750), HP ProCurve 2910-48 ($3,500) both 1Gb|
|infiniband|| 72-Port 4X Configuration Modular DDR InfiniBand Switch, all nodes, ||(3?) 12PT 4X DDR-INFINIBAND, all nodes||(3?) 12PT 4X DDR-INFINIBAND, all nodes||Voltaire IB 4X QDR 36P, all nodes, plus head node for IPoIB|
|head node||dual quad core Xeon E5620 2.40GHz w/ 12MB, 6×2=12 gb, 2×500 gb hdd (rack)||XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade)||XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade)||HP E5620 dual-quad DL380G7, 6×2=12 gb, 2x250gb (rack)|
|storage||Pinnacle Flexible Storage System, single Intel Xeon Quad Core X3360 2.83GHz w/ 12MB cache, 8 gb ram||PV MD3200i, 6x PV MD1200, direct attach||NX4 10Gb NAS (EMC2)||HP StorageWorks MSA60 Array, direct attach|
|storage size||24x2TB=48TB @7.2K rpm, expandable to 64TB||7x (12x600gb=7.2TB)=50 TB @15K, expandable to 57.6 TB||2(12×2)=48TB @7.2K rpm, expandable||48x1TB=48TB @7.2K rpm, 4 trays total, not expandable|
|disk,raid||SATA, 6||SAS, 5/6||SATA, 5/6||SATA, 5/6|
|storage costs||$15K, $0.31/GB||$72.5K, $1.42/GB||$52K, $1.06/GB||$45K, $0.92/GB|
|storage functions||multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots imaging enabled, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable||dual controller, optional snapshots, direct attach iSCSI, head node performs NFS||CIFS/NFS/FTP + iSCSI and Fiber, expandable to 96 TB, snapview licensed, deduplication capable||multiple NICs (on head node), direct attach, head node performs NFS duties via IPoIB, not expandable, no snapshotting|
|storage mgt software||yes||?||yes||yes|
|OS||CentOS 5.x||RHEL 5.3||RHEL 5.3||RHEL5.3|
|software||Intel C++, Fortran, MPI, MKL||none||dell||none|
|scheduler||gridengine||Platform LSF||Platform LSF||gridengine|
|management||Breakin, Cloner, Beo Utils, Act Dir, Ganglia||Platform OCS5.x||Platform OCS5.x||HP Cluster Mgt Utility Lic and Media|
|UPS||3000VA UPS||5600W, 4U, 208V||5600W, 4U, 208V||none|
|support||3 years, NBD||3 years, 4-Hour 7×24 On-site Service||3 years, NBD||3 years, NBD|
|cost to run (9.36c/KWH) /wo A/C||$10,897.28||$11,797.12||$8,952.57 (Watts+AC: 24% greener than dell - saves $5,700/year, 18% greener than ACT - saves $4,000/year)|
|Us Used per rack||40/42U||??/48||33/42U|
|Note1||unlimited lifetime technical support for all hardware and software we supply||SAS drives 600 gb @15K is $750 vs 500 gb @7.2K SATA is $275, estimate is that could save $40K||NX4 has more protocols than we need but so what, $1/GB is good||arrives fully integrated, with knowledge transfer/training for a week on all parts of the cluster|
|Note2||this solution could do IPoIB, presumably yielding an NFS boost in performance, however desire native NFS, see next section||Or it could be that 15K SAS drives in array is a good idea?||UPS and iKVM are gone, no big deal||IPoIB the unknown, we could do NFS on head node, but then a bottleneck?|
|Note3||all MPI flavors precompiled with Gnu, Portland (for AMD), Intel (for Xeon)||Platform OCS and LSF add costs, support is nice though||Total cost for OCS/LSF/RHEL is about $20K, that sets up a differentiall with ACT of $4.5K now||
|Note4||$75K buys another 135 job slots, 17 nodes||$75K buys another 80 job slots, 10 nodes||there a second NX4 expansion shelf quoted which is not needed, could reduce quote by $5K||$75k buys another 180 job slots, 22 nodes|
|Note5||with these large hard disks on compute nodes, investigate the idea if /localscratch partitions can be presented as one scratch file system via Lustre,ie 200×30= 6TB, definitely worth the effort|
|Type||Bandwidth (Gbps)||~Latency (us)||Price NIC+Port ($)|
|10 Gigabit Ethernet||10||40||1,350|
|Infiniband 4x SDR||8||4||600|
|Infiniband 8x SDR||16||4||720|
|Infiniband 16x SDR||32||4||1,200|
[HP Engineer] That DL380 has a P812 SAS controller, which has support for 6Gb SAS, and four 4x connectors to the MDS60 JBODs, and the RAID controller has eight 6Gb/s SAS physical links. That's a heavy-weight controller (with a 1GB cache), and the 8x PCIe bus (~4GB/s) and InfiniBand (~4GB/s) will be able to keep up with whatever the storage can provide. While the drives are listed as SAS, at that density, they are SATA internal with a SAS interface. As a rule of thumb, you might get as much as 100MB/s per drive, but realistically under normal workload (and across all parts of the platters, not just the fast outer edges), it's safer to assume ~50MB/s. If you implement RAID6 with one parity and one spare, that leaves 10 drives per MSA, or around 500MB/s per. So, it's possible that the storage could provide up to 2GB/s. To me, that seems like a very high number, which would require xfs local file system to see the local bandwidth. And I've not seen a Linux NFS file system generate more than around 6-700MB/s per server, so when we layer that on top, the storage won't be the limit. I think your question was “where will the bottleneck be?”, and I'd speculate that it's the NFS file system. If you are being asked to commit to a specific performance number, then I believe we will need to run actual benchmarks on your configuration because the software environment is subject to a lot of variability.
[Barry]The performance you can expect on IPoIB is between 200mb/s and 300mb/s based on the information provided by the HP HPC engineering teams strings of emails.
Advance Clustering states:
We have used IPoIB on several occasions with great success. Using IPoIB should take away any network bottleneck and leave the drives as the bottleneck. With NFS over IPoIB you will have a theoretical 10Gb link to your NFS storage which still should leave plenty of bandwidth for your calculations.
But, engineer warns, that IPoIB and FSS software might be unstable. Hence, the FSS shelf would be replaced with another expansion shelf, presented as a block device to head node, which now becomes an NFS server. Then IPoIB could be deployed cluster wide. Not an attractive option.
Attached is the updated quote with 10Gb connectivity on the primary storage node without IPoIB but with 10 Gb. The switch we are using has one 10Gb port. ($152,283.77) or an additional $1,070 to quote detailed above). This means traffic between storage node and switch is 10Gb and between switch and nodes is 1Gb.
If we purchased just one rack via ACT or Dell (HP unlikely at this time), and we reserve $25K for database server, we could still invest in old Dell rack. For instance:
Refurbished materials from Dell. But I'm not sure this is worth the effort. Unsupported hardware. The better investment would be to increase the size of the “primary” new cluster, that is go with Advanced Clustering. But worth investigating.