HP HPC

Notes for the cluster design conference with HP.

“do later” means we tackle after the HP on site visit.

S & H

Shipping Address: 5th floor data center
No 13'6“ truck, 12'6” is ok or box truck
Delivery on standard raised dock, no ways to lift rack out of truck if not docked
Freight Elevator and pallet jack available

Network

Basically …

configure all console port switches with an IP
- depending on switch IP in 192.168.102.x or 10.10.102.x
- voltaire console can be stuffed in either

head node will be connected to our private network via a two link aggregated ethernet cables in the 10.10.x.y range so current home directories can be mounted somewhere (these dirs will not be available on the back end nodes.

x.y.z.255 is broadcast
x.y.z.254 is head or log in node
x.y.z.0 is gateway
x.y.z.<10 is for all switches (prefer 1) and console/management ports
x.y.z.10(up to 253) is for all compute nodes

We are planning to ingest our Dell cluster (37 nodes) and our Blue Sky Studios cluster (130 nodes) into this setup, hence the approach.

Netmask is, finally, 255.255.0.0 (excluding public 129.133 subnet).

Update with the following: Hi Shanna, ok, i see that, so globally lets go with

eth0 192.168.102.x/255.255.0.0
eth1 10.10.102.x/255.255.0.0 (data, need to reach netapp filer at 10.10.0.y/255.255.0.0)
eth2 129.133.1.226 public (wesleyan.edu)
eth3 192.168.103.x/255.255.255.0 ipmi (or over eth0?)
eth4 192.168.104.x/255.255.255.0 ilo (or over eth0?)
ib0 10.11.103.x/255.255.255.0 ipoib (data)
ib1 10.11.104.x/255.255.255.0 ipoib (data, not used at the start)

where x=254 for head and x=10(increment by 1) for nodes n1-n32

does that work for you? i'm unsure how ilo/ipmi works but it could use eth0.

-Henk

Infiniband

HP Link

Voltaire 4036
519571-B21
Voltaire InfiniBand 4X QDR 36-Port Managed Switch

Configuration, fine tuning, identify bottlenecks, monitor, administer. Investigate Voltaire UFM?

DM380G7

HP Link (head node)
External Link video about hardware

Dual power (one to UPS, one to utility, do later)

hostname greentail, another local “tail”, also in reference to HP being 18-24% more efficient in power/cooling
eth0, provision, 192.168.102.254/255.255.0.0 (greentail-eth0, should go to better switch ProCurve 2910)
- do we need a iLo eth? in range 192.168.104.254?
eth1, data/private, 10.10.102.254/255.255.0.0 (greentail-eth1, should go to ProCurve 2610)
eth2, public, 129.133.1.226/255.255.255.0 (greentail.wesleyan.edu, we provide cable connection)
eth3 (over eth0), ipmi, 192.168.103.254/255.255.0.0, (greentail-ipmi, should go to better switch ProCurve 2910, do later)
- see discussion iLo/IPMI under CMU
ib0, ipoib, 10.10.103.254/255.255.0.0 (greentail-ib0)
ib1, ipoib, 10.10.104.254/255.255.0.0 (greentail-ib1, configure, might not have cables!, split traffic across ports?)

Raid 1 mirrored disks (2x250gb)
/home mount point for home directory volume ~ 10tb (contains /home/apps/src)
/snapshot mount point for snapshot volume ~ 10tb
/sanscratch mount point for sanscratch volume ~ 5 tb
/localscratch … maybe just a directory
logical volumes ROOT/VAR/BOOT/TMP: defaults

IPoIB configuration
SIM configuration
CMU configuration
SGE configuration

StorageWorks MSA60

HP Link (storage device)

Dual power (one to UPS, one to utility, do later)

Three volumes to start with:
- home (raid 6), 10 tb
- snapshot (raid 6), 10 tb … see todos.
- sanscratch (raid 1 or 0, no backup), 5 tb

SIM

SL2x170z G6

HP Link (compute nodes)

node names n0, increment by 1
eth0, provision, 192.168.102.10(increment by 1)/255.255.0.0 (n0-eth0, should go to better switch ProCurve 2910)
- do we need an iLo eth? in range 192.168.104.10(increment by 1)
- CMU wants eth0 on NIC1 and PXEboot
eth1, data/private, 10.10.102.10(increment by 1)/255.255.0.0 (n0-eth1, should go to ProCurve 2610)
eth2 (over eth0), ipmi, 192.168.103.10(increment by 1)/255.255.0.0, (n0-ipmi, should go to better switch ProCurve 2910, do later)
- see discussion iLo/IPMI under CMU
ib0, ipoib, 10.10.103.10(increment by 1)/255.255.0.0 (n0-ib0)
ib1, ipoib, 10.10.104.10(increment by 1)/255.255.0.0 (n0-ib1, configure, might not have cables!)

/home mount point for home directory volume ~ 10tb (contains /home/apps/src)
/snapshot mount point for snapshot volume ~ 10tb
/sanscratch mount point for sanscratch volume ~ 5 tb
(next ones must be 50% empty for cloning to work)
logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (60 gb left for OS)
logical volumes ROOT/VAR/BOOT/TMP: defaults

SIM

Misc

IPoIB
- configuration, fine tune
- monitor

Systems Insight Manager (SIM)
HP Link (Linux Install and Configure Guide, and User Guide)
- Do we need a windows box (virtual) to run the Central Management Server on?
- SIM + Cluster Monitor (MSCS)?
- install, configure
- requires an oracle install? no, hpsmdb is installed with automatic installation (postgresql)
- linux deployment utilities, and management agents installation
- configure managed systems, automatic discovery
- configure automatic event handling

Cluster Management Utility (CMU up to 4,096 nodes)
HP Link (Getting Started - Hardware Preparation, Setup and Install – Installation Guide v4.2, Users Guides)
- HP iLo probably removes the need for IPMI, consult External Link, do the blades have a management card?
  - well maybe not, IPMI (External Link) can be scripted to power on/off, not sure about iLo (all web based)
    - hmm, we can power up/off via CMU so perhaps IPMI is not needed nor this ability via SIM and web browser
- is head node the Management server? possibly, needs access to provision and public networks
- we may need a iLo eth? in range … 192.198.104.x? Consult the Hardware Preparation Guide.
- CMU wants eth0 on NIC1 and PXEboot
- install CMU management node
- install X and CMU GUI client node
- start CMU, start client, scan for nodes, build golden image
- install monitoring client when building golden image node via CMU GUI
- clone nodes, deploy management agent on nodes
  - PXEboot and wake-on-lan must be done manually in BIOS
  - pre_reconf.sh (/localscratch partition?) and reconf.sh (NIC2 definition)
- not sure we can implement CMU HA
- collectl/colplot seems nice

Sun Grid Engine (SGE)
- install, configure
- there will only be one queue (hp12)

Other

KVM utility
- functionality

Placement
- where in data center (do later), based on environmental works

ToDo

All do later. After HP cluster is up.

Backups. /snapshot volume
Use trickery with linux and rsync to provide snapshots? External Link and another External Link
- Exclude very large files?
- petaltail:/root/snapshot.sh or rotate_backups.sh as examples
- or better http://www.rsnapshot.org/

Lava. Install from source and evaluate.

Location
- remove 2 BSS racks (to pace.edu?), rack #3 & 4
- add an L6-30 if needed (have 3? check)
- fill remaining 2 BSS racks with 24gb good servers, turn off

Back

DokuWiki

Table of Contents

HP HPC

S & H

Network

Infiniband

DM380G7

StorageWorks MSA60

SL2x170z G6

Misc

Other

ToDo

DokuWiki

User Tools

Site Tools

Table of Contents

HP HPC

S & H

Network

Infiniband

DM380G7

StorageWorks MSA60

SL2x170z G6

Misc

Other

ToDo

Page Tools