DokuWiki

Overview

With the second NSF proposal in a “recommended for funding” state, I'm preparing this page so we can address some looming issues and make decisions on our potential new acquisition. In general, these are the main topics:

Data Center/ITS Items
- Increased cooling requirements
- More electric L6-30 connectors
- 192.168.1.xxx subnet
- FTE requirments

Cluster/Committee Items
- Increase Home Directory Storage
- More infiniband nodes
- More nodes with small memory footprint
- Database server

New ventures
- Should we investigate Platform ISF, private cloud for HPC, will do both windows and linux compute nodes in adaptive mode (meaning in virtualized environment).

Below presents some detail regarding these points to get the discussion started.

Tada Moment!

April 27th, 2010 Update: So i had this bright idea. Spurred on by the following. The private networks on both clusters have a netmask that yields 255 IPs per subnet. Currently the clusters are chewing up about 175 of those. Now, with a new cluster potentially needing 512 more, we have a netmask and routing problem. Could be very messy.

Light bulb! So if we got a huge cluster, why not make it the primary. For example, we could run CentOS/Kusu/Lava on it or Redhat/OCS/LSF (costs involved) and adjust the netmasks to something yielding tons of IPs, like 255.255.0.0 … once running, we ingest into this cluster the petal/swallow-tail and sharptail cluster. Much cleaner and easier to do, and much more flexible. We can then logically segment the physical hardware in “clusters”.

Data Center

The breaker box panel currently provides 16 L6-30 connectors and can provide 4 more. We need to assess how much of a draw dawn there currently is, and then size out if a new subpanel can provide enough connectors for the new cluster.

Cooling requirements of the new cluster probably will be met after the data center's environment work project is finished. We probably will still be in the scenario that if a cooling unit dies, part or all of the cluster would need to be shut down preventing overheating of the entire data center.

Unfortunately, the netmask of the Dell cluster was set to 255.255.255.0 resulting in 255 IPs in subnet 192.168.1.xxx. After switches, and both clusters, about 75 IPs are available in this range. If the new cluster comes with few multi-cored nodes this may not be a problem. However, if many IPs are needed, the new cluster would reside in another subnet and the routing tables would have to change, if the clusters need to communicate with each other (as in an expansion of the Dell clusters).

Regarding the 0.5 existing FTE allocation, the new cluster would have serious impact on this. Especially if the new cluster introduces a new operating system (for example x86 solaris) and a new scheduler (for example grid engine or PBS). This will also impact the SCIC for additional documentation and user support.

Cluster/Committee Items

Home directories are currently served from a NetApp filer via NFS (5 TB). This disk space was bought with the Dell cluster from ITS (sidebar: i think this is correct). Since the filer provides lots of functionality that we do not use, we could entertain the idea of giving this space back. We would then configure the new cluster to house a 25-50 TB disk array serving up the home directories via NFS. Would require a UPS for the disk array, or put it on Enterprise UPS. Also would require all nodes on all clusters to communicate with each other, unless we kept the NetApp space and added the new disk array to new cluster only.

If the new cluster expands the existing Dell cluster we would not need another head/login node. That may add more compute nodes. No more than 75 nodes could be added before we run into routing problems.

All quotes obtained with the proposal contain compute nodes all on infiniband and ethernet switches. This may the simplest way to go configuration wise. But the need for many more compute nodes with a small memory footprint probably does not need infiniband connectivity.

The proposal, if i recall correctly, proposes the addition of a database server. Specifics as to what metrics this database server must need is unknown. None of the quotes contain a database server. Probably should be a MySQL server to be compliant with any commercial software purchased in the future. Would also require a UPS and the ability of all compute nodes to communicate with the database server. Potentially we could turn host swallowtail into a database server by adding disks and memory.

Another item of concern, is that if we move away from RHEL to for example Solaris, all the software needs to be recompiled. The new cluster will by far be the most preferred. Recompiling the software base on petaltail was an enormous task. A migration to CentOS is not a huge task as demonstrated by sharptail.

Comparison of Quotes

Item	Advanced Clustering	Dell	Sun
job slots	512 (reduce to 352 for $200K)	160	272
compute nodes	blades	rack server	rack server
node count	64 (reduce to 44 for $200K)	20	34
chip type	dual-quad E5550	dual-quad E5420	dual-quad AMD 2376
chip speed	2.26 ghz	2.5 ghz	2.3 ghz
memory	12 gb	8 gb	8 gb
hdd	250 gb	2×73 gb	250 gb
infiniband	all	all	all
login node	dual quad core E5520, 12 gb, 2×500 gb hdd	none	dual quad core AMD 2380, 8 gb, 4×146 gb hdd
storage	Pinnacle Flexible Storage System, 12x2TB=24TB@7200KRPM, disk shelf expandable to 32TB	NX4 NAS dual blade, 12x1TB=24TB@7200KRPM	sunfire 4550, 16x4gb=64gb mem, 48x1TB=48TB@7200KRPM
raid	SATA, 6	SATA, ?	SATA, ZFS
storage costs	$10K, $0.41/GB	$52K, $2.12/GB	$55K, $1.12/GB
storage functions	multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable	multiple NICs, optional 10 gb, snapshot(view), optional replication, CFIS/NFS/FTP, multiple raids (inclu 5&6), expandable to 60 TB with 4 disk shelves, deduplication, file level compression	multiple NICs, ZFS with RaidZ, unlimited scalability
OS	CentOS 5.x	RHEL 5.x	Solaris 10
software	Intel C++, Fortran, MPI, MKL
scheduler	gridengine	platform LSF (20)	gridengine
management	Beo Utils, Ganglia	Ganglia, NTop, Cacti	Xvm
UPS	3000VA, 2U
iKVM	yes
support	3 years	3 years	3 years
L6-30	13	estimated 4	estimated 8
cooling	7.75 tons
weight	3,000 lbs
cost to run /wo AC	$22,463 ($0.09/kWH
nr of racks	2x42U	1x42U	1x42U

Alternate Storage Options

Instead of adopting the vendor's suggested storage device, we could insert one of these. This would require a server in front of these devices listed below which would then provide the file systems to all nodes via NFS. However, the storage devices of the vendors are capable of native NFS, so essentially add that server.

All solutions need a UPS or potentially be powered by the ITS enterprise UPS.

Also, as the size increases, we probably can not back up to the VTL anymore. In such a case, the snapshot capability is important (point in time restore). Also we could design an “archive” present on these devices (raid 1 for example). Or perform disk2disk backup copies, still on same device.

Item	Nexan	RAID Inc	Nexan
name	SataBeast	Xanadu	SataBeast
controllers	dual	dual	dual
cache	2gb per controller	2g per controller	2 gb per controller
size	42x1TB=42TB@7.2KRPM SATA	84x1TB=84TB@7.2KRPM SATA	14x2TB=28TB@7.2KRPM SATA
cost/GB	$1.14/GB	$0.96/GB	$1.74/gb
protocols	iSCSI/FC	iSCSI/FC	iSCSI/FC
raid 1,5&6	yes	yes	yes
features	multiple volumes, multiple raids, expandable, no snapshot license	dynamic pooling, optional snapshots, expandable	multiple volumes, multiple raids, expandable, no snapshot license
front-end required?	yes	yes	yes
form factor	4 U	12 U	4 U
support	3 years	3 years	3 years

If we obtained for example the Nexsan 48 TB (raw) block level device, we could entertain some new ideas. Create a 10 TB Lustre file system for fast scratch space. Create a 5-10 TB data archive to avoid data to be duplicated. Enlarge home directory space to 10-15 TB. Perform D2D backup.

Round 2 of Quotes

Item	Advanced Clustering updated	Dell	Dell #2 updated	HP updated!

job slots	240	80	240	256
overall cost/job slot (minus head node & storage)	$ 550	$ 892	$ 569	$ ~~415~~
compute nodes	blades	blades	blades	blades
node count	30	10	30	32
chip type	dual-quad E5620w/12MBCache “Westmere”	dual-quad E5620w/12MBCache “Westmere”	dual-quad E5620w/12MBCache “Westmere”	dual-quad E5620w/12MBCache “Westmere”
chip speed	2.40 ghz	2.40 ghz	2.40 ghz	2.40 ghz
memory	6×2=12 gb	6×2=12 gb	6×2=12 gb	12×1=12 gb
hdd	1×250 gb	1×146 gb (15k SAS)	1×146 gb (15k SAS)	1×160 gb
ehternet	2 x Netgear GSM7352 v2 - 48-port, 10Gb switch to storage	two PowerConnect 6248 - 48 port	two PowerConnect 6248 - 48, 10Gb switch to storage	ProCurve Switch 2610-48 ($750), HP ProCurve 2910-48 ($3,500) both 1Gb
infiniband	72-Port 4X Configuration Modular DDR InfiniBand Switch, all nodes, ~~includes head node and storage devices for IPoIB~~ (see next section)	(3?) 12PT 4X DDR-INFINIBAND, all nodes	(3?) 12PT 4X DDR-INFINIBAND, all nodes	Voltaire IB 4X QDR 36P, all nodes, plus head node for IPoIB
head node	dual quad core Xeon E5620 2.40GHz w/ 12MB, 6×2=12 gb, 2×500 gb hdd (rack)	XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade)	XeonE5620 2.4Ghz, 12M Cache, 6×2=12 gb, 2(?)x146 gb hdd (blade)	HP E5620 dual-quad DL380G7, 6×2=12 gb, 2x250gb (rack)
storage	Pinnacle Flexible Storage System, single Intel Xeon Quad Core X3360 2.83GHz w/ 12MB cache, 8 gb ram	PV MD3200i, 6x PV MD1200, direct attach	NX4 10Gb NAS (EMC2)	HP StorageWorks MSA60 Array, direct attach
storage size	24x2TB=48TB @7.2K rpm, expandable to 64TB	7x (12x600gb=7.2TB)=50 TB @15K, expandable to 57.6 TB	2(12×2)=48TB @7.2K rpm, expandable	48x1TB=48TB @7.2K rpm, 4 trays total, not expandable
disk,raid	SATA, 6	SAS, 5/6	SATA, 5/6	SATA, 5/6
storage costs	$15K, $0.31/GB	$72.5K, $1.42/GB	$52K, $1.06/GB	$45K, $0.92/GB
storage functions	multiple NICs, optional 10gb and Infiniband support, LVs > 2 TB, snapshots imaging enabled, quota, antivirus scanning, CFIS/NFS/AFP, multiple raids (incl 5&6), cluster expandable	dual controller, optional snapshots, direct attach iSCSI, head node performs NFS	CIFS/NFS/FTP + iSCSI and Fiber, expandable to 96 TB, snapview licensed, deduplication capable	multiple NICs (on head node), direct attach, head node performs NFS duties via IPoIB, not expandable, no snapshotting
storage mgt software	yes	?	yes	yes
OS	CentOS 5.x	RHEL 5.3	RHEL 5.3	RHEL5.3
software	Intel C++, Fortran, MPI, MKL	none	dell	none
scheduler	gridengine	Platform LSF	Platform LSF	gridengine
management	Breakin, Cloner, Beo Utils, Act Dir, Ganglia	Platform OCS5.x	Platform OCS5.x	HP Cluster Mgt Utility Lic and Media
UPS	3000VA UPS	5600W, 4U, 208V	5600W, 4U, 208V	none
iKVM	yes	yes	no	yes
support	3 years, NBD	3 years, 4-Hour 7×24 On-site Service	3 years, NBD	3 years, NBD
L6-30	3		2(?)	4
Watts	12,891		13,943	10,602
BTU/hr	43,984		47,616	36,175
A/C Tons	3.67		4	3
weight lbs	1829.9			1,006
cost to run (9.36c/KWH) /wo A/C	$10,897.28		$11,797.12	$8,952.57 (Watts+AC: 24% greener than dell - saves $5,700/year, 18% greener than ACT - saves $4,000/year)
Us Used per rack	40/42U	??/48		33/42U
Note1	unlimited lifetime technical support for all hardware and software we supply	SAS drives 600 gb @15K is $750 vs 500 gb @7.2K SATA is $275, estimate is that could save $40K	NX4 has more protocols than we need but so what, $1/GB is good	arrives fully integrated, with knowledge transfer/training for a week on all parts of the cluster
Note2	this solution could do IPoIB, presumably yielding an NFS boost in performance, however desire native NFS, see next section	Or it could be that 15K SAS drives in array is a good idea?	UPS and iKVM are gone, no big deal	IPoIB the unknown, we could do NFS on head node, but then a bottleneck?
Note3	all MPI flavors precompiled with Gnu, Portland (for AMD), Intel (for Xeon)	Platform OCS and LSF add costs, support is nice though	Total cost for OCS/LSF/RHEL is about $20K, that sets up a differentiall with ACT of $4.5K now	~~Unsure about the lack of management/OS software preinstalled/preconfigured, what if we hit a driver problem with HP hardware?~~ unit will be fully integrated to our specs with whatever we want (CentOS or RHEL/Gridengine)
Note4	$75K buys another 135 job slots, 17 nodes	$75K buys another 80 job slots, 10 nodes	there a second NX4 expansion shelf quoted which is not needed, could reduce quote by $5K	$75k buys another 180 job slots, 22 nodes
Note5	with these large hard disks on compute nodes, investigate the idea if /localscratch partitions can be presented as one scratch file system via Lustre,ie 200×30= 6TB, definitely worth the effort

IPoIB

Type	Bandwidth (Gbps)	~Latency (us)	Price NIC+Port ($)
Gigabit Ethernet	1	40	40
10 Gigabit Ethernet	10	40	1,350
Infiniband 4x SDR	8	4	600
Infiniband 8x SDR	16	4	720
Infiniband 16x SDR	32	4	1,200

For point-to-point throughput, IP over InfiniBand (Connected Mode) is comparable to a native InfiniBand
When a disk is a bottleneck, NFS can not benefit from IPoIB
When a disk is not a bottleneck, NFS benefits significantly from both IPoIB

HP states

[HP Engineer] That DL380 has a P812 SAS controller, which has support for 6Gb SAS, and four 4x connectors to the MDS60 JBODs, and the RAID controller has eight 6Gb/s SAS physical links. That's a heavy-weight controller (with a 1GB cache), and the 8x PCIe bus (~4GB/s) and InfiniBand (~4GB/s) will be able to keep up with whatever the storage can provide. While the drives are listed as SAS, at that density, they are SATA internal with a SAS interface. As a rule of thumb, you might get as much as 100MB/s per drive, but realistically under normal workload (and across all parts of the platters, not just the fast outer edges), it's safer to assume ~50MB/s. If you implement RAID6 with one parity and one spare, that leaves 10 drives per MSA, or around 500MB/s per. So, it's possible that the storage could provide up to 2GB/s. To me, that seems like a very high number, which would require xfs local file system to see the local bandwidth. And I've not seen a Linux NFS file system generate more than around 6-700MB/s per server, so when we layer that on top, the storage won't be the limit. I think your question was “where will the bottleneck be?”, and I'd speculate that it's the NFS file system. If you are being asked to commit to a specific performance number, then I believe we will need to run actual benchmarks on your configuration because the software environment is subject to a lot of variability.

[Barry]The performance you can expect on IPoIB is between 200mb/s and 300mb/s based on the information provided by the HP HPC engineering teams strings of emails.

Advance Clustering states:

We have used IPoIB on several occasions with great success. Using IPoIB should take away any network bottleneck and leave the drives as the bottleneck. With NFS over IPoIB you will have a theoretical 10Gb link to your NFS storage which still should leave plenty of bandwidth for your calculations.

But, engineer warns, that IPoIB and FSS software might be unstable. Hence, the FSS shelf would be replaced with another expansion shelf, presented as a block device to head node, which now becomes an NFS server. Then IPoIB could be deployed cluster wide. Not an attractive option.

Attached is the updated quote with 10Gb connectivity on the primary storage node without IPoIB but with 10 Gb. The switch we are using has one 10Gb port. ($152,283.77) or an additional $1,070 to quote detailed above). This means traffic between storage node and switch is 10Gb and between switch and nodes is 1Gb.

ACT Questions

What size was the cluster (storage, nodes)?
1. WSU 48 nodes, 384 cores, 5 tb, all infiniband, management over ethernet
2. UCLA 64 ACT nodes, 512 cores, added to 600 node cluster, all infiniband, 100 tb storage
3. FSU in top 500 size wise, public and owner owned clusters, two ACT clusters 30 & 10 nodes, enterprise panasas storabe 196 tb
4. KTZT 9 nodes, 40 tb, in two purchases
What OS, scheduler? Familiarity with that in-house?
1. WSU Torgue,
2. UCLA CentOS, Sun gridengine (goes commercial via oracle?)
3. FSU CentOS, Rocks, MOAB
4. KTZT CentOS, Sun gridengine
Experiences in the setup and final configuration?
1. WSU ACT send folks for 1st cluster, second cluster was easy
2. UCLA very experienced in house and wipe everything delivered
3. FSU wipes all hardware
4. KTZT on site visit ($1900), useful, no experiences before hand
Was everything the way it was quoted?
1. WSU Yes
2. UCLA Yes
3. FSU Yes
4. KTZT yes
What software is primarily used?
1. WSU Amber, Gaussian
2. UCLA All kinds, very similar to us, chem/physics/engineering/bio
3. FSU All kinds
4. KTZT NOAA, climate modeling
What does the cluster user base look like? Curriculum usage?
1. WSU Few active accounts (<10), single faculty member + post docs
2. UCLA 350 users across many dept, primarily grads/faculty, no class usage
3. FSU hundreds of users
4. KTZT one faculty, four students
How many administrators are involved? IT staff or students?
1. WSU 15% of post doc system admin
2. UCLA many IT persons, estimates that the ACT we have spe'ed requires one experienced person
3. FSU large IT dept
4. KTZT faculty
Experiences with their support, hardware and software related?
1. WSU ACT support has been excellent, prompt replacement parts, solid phone support, considers this the best part of their operation
2. UCLA IT does not use it, but the depts folks use it, however when involved on those cases ACT support is very knowledgeable and quick, same person on all contacts/tickets (same as platform)
3. FSU none, expecpt support, rated excellent, very knowlegeable
4. KTZT Beyond believable, excellent, grand, both hardware and software
Any items you would have done differently with them?
1. WSU None
2. UCLA be attentive to quote details, for example infiniband SDR vs DDR
3. FSU NA
4. KTZT storage is slow, probably NFS related, get more memory for cpu

HP Questions

UMICH

Dell complacent, HP proactive (buys $1M in 3 years, 2852 cores)
No experience with software
Good HP support experience
HP more efficient than Dell, supports that
Torque/PBS shop
No IPoIB experience (yet)
Also a Dell shop

Other Thoughts

If we purchased just one rack via ACT or Dell (HP unlikely at this time), and we reserve $25K for database server, we could still invest in old Dell rack. For instance:

collapse ehw and ehwfd
- buy 2 more md1000 with even larger 15k disks
- makes a single queue of 8 nodes
collapse elw, emw
- universal 16 gb memory footprint per node
- makes a single queue of 12 nodes
imw
- double memory to 16 gb/node

Refurbished materials from Dell. But I'm not sure this is worth the effort. Unsupported hardware. The better investment would be to increase the size of the “primary” new cluster, that is go with Advanced Clustering. But worth investigating.

Back