Differences

This shows you the differences between two versions of the page.

--- cluster:89 [2010/08/17 19:06]
hmeij
+++ cluster:89 [2010/10/21 15:29]
hmeij
@@ Line 22: / Line 22: @@
     * depending on switch IP in 192.168.102.x or 10.10.102.x
     * voltaire console can be stuffed in either
+  * head node will be connected to our private network via a two link aggregated ethernet cables in the 10.10.x.y range so current home directories can be mounted somewhere (these dirs will not be available on the back end nodes.
   * x.y.z.255 is broadcast
   * x.y.z.254 is head or log in node
   * x.y.z.0 is gateway
-  * x.y.z.<25 is for all switches and console ports
+  * x.y.z.<10 is for all switches and console ports
-  * x.y.z.25( up to 253) is for all compute nodes
+  * x.y.z.10( up to 253) is for all compute nodes
 We are planning to ingest our Dell cluster (37 nodes) and our Blue Sky Studios cluster (130 nodes) into this setup, hence the approach.
 Netmask is, finally, 255.255.0.0 (excluding public 129.133 subnet).
+===== Infiniband =====
+[[http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=vn&prodTypeId=12883&prodSeriesId=3758753&lang=en&cc=vn|HP Link]]
+  * Voltaire 4036
+  * 519571-B21
+  * Voltaire InfiniBand 4X QDR 36-Port Managed Switch
+Configuration, fine tuning, identify bottlenecks, monitor, administer.  Investigate [[http://www.voltaire.com/Products/Unified_Fabric_Manager|Voltaire UFM]]?
 ===== DM380G7 =====
-[[http://h10010.www1.hp.com/wwpc/us/en/sm/WF31a/15351-15351-3328412-241644-241475-4091412.html|HP Link]] (head node)
+[[http://h10010.www1.hp.com/wwpc/us/en/sm/WF31a/15351-15351-3328412-241644-241475-4091412.html|HP Link]] (head node)\\
+[[http://vimeo.com/9938744|External Link]] video about hardware
   * Dual power (one to UPS, one to utility, do later)
@@ Line 42: / Line 56: @@
     * do we need a iLo eth? in range 192.168.104.254?
   * eth1, data/private, 10.10.102.254/255.255.0.0 (greentail-eth1, should go to ProCurve 2610)
-  * eth2, public, 129.133.1.226/255.255.255.0 (greentail.wesleyan.edu)
+  * eth2, public, 129.133.1.226/255.255.255.0 (greentail.wesleyan.edu, we provide cable connection)
   * eth3 (over eth2), ipmi, 192.168.103.254/255.255.0.0,  (greentail-ipmi, should go to better switch ProCurve 2910, do later)
     * see discussion iLo/IPMI under CMU
@@ Line 49: / Line 63: @@
   * Raid 1 mirrored disks (2x250gb)
-  * /home mount point for home directory volume ~ 10tb
+  * /home mount point for home directory volume ~ 10tb (contains /home/apps/src)
-  * /home/apps mount point for software volume ~ 1tb (contains /home/apps/src)
+  * /snapshot mount point for snapshot volume ~ 10tb
-  * /home/sanscratch mount point for sanscratch volume ~ 5 tb
+  * /sanscratch mount point for sanscratch volume ~ 5 tb
   * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (should match nodes at 160 gb, leave rest for OS)
   * logical volumes ROOT/VAR/BOOT/TMP: defaults
@@ Line 66: / Line 80: @@
   * Three volumes to start with:
-    * home (raid 6, design a backup path, do later), 10 tb
+    * home (raid 6), 10 tb
-    * apps (raid 6, design a backup path, do later), 1tb
+    * snapshot (raid 6), 10 tb ... see todos.
-    * sanscratch (raid 1, no backup), 5 tb
+    * sanscratch (raid 1 or 0, no backup), 5 tb
   * SIM
@@ Line 86: / Line 100: @@
     * ib1, ipoib, 10.10.104.25(increment by 1)/255.255.0.0 (hp000-ib1, configure, might not have cables!)
-    * /home mount point for home directory volume ~ 10tb
+    * /home mount point for home directory volume ~ 10tb (contains /home/apps/src)
-    * /home/apps mount point for software volume ~ 1tb (contains /home/apps/src)
+    * /snapshot mount point for snapshot volume ~ 10tb
-    * /home/sanscratch mount point for sanscratch volume ~ 5 tb
+    * /sanscratch mount point for sanscratch volume ~ 5 tb
+    * (next ones must be 50% empty for cloning to work)
     * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (60 gb left for OS)
     * logical volumes ROOT/VAR/BOOT/TMP: defaults
@@ Line 100: / Line 115: @@
     * monitor
-  * Systems Insight Manager (SIM) [[http://h18013.www1.hp.com/products/servers/management/hpsim/index.html?jumpid=go/hpsim|HP Link]] (Linux Install and Configure Guide, and User Guide)
+  * Systems Insight Manager (SIM)
+  * [[http://h18013.www1.hp.com/products/servers/management/hpsim/index.html?jumpid=go/hpsim|HP Link]] (Linux Install and Configure Guide, and User Guide)
     * Do we need a windows box (virtual) to run the Central Management Server on?
     * SIM + Cluster Monitor (MSCS)?
@@ Line 109: / Line 125: @@
     * configure automatic event handling
-  * Cluster Management Utility (CMU)[[http://h20338.www2.hp.com/HPC/cache/412128-0-0-0-121.html|HP Link]] (Getting Started - Hardware Preparation, Setup and Install -- Installation Guide v4.2, Users Guides)
+  * Cluster Management Utility (CMU up to 4,096 nodes)
-  * iLo/IPMI
+  * [[http://h20338.www2.hp.com/HPC/cache/412128-0-0-0-121.html|HP Link]] (Getting Started - Hardware Preparation, Setup and Install -- Installation Guide v4.2, Users Guides)
     * HP iLo probably removes the need for IPMI, consult [[http://en.wikipedia.org/wiki/HP_Integrated_Lights-Out|External Link]], do the blades have a management card?
-    * well maybe not, IPMI ([[http://en.wikipedia.org/wiki/Ipmi|External Link]]) can be scripted to power on/off, not sure about iLo (all web based)
+      * well maybe not, IPMI ([[http://en.wikipedia.org/wiki/Ipmi|External Link]]) can be scripted to power on/off, not sure about iLo (all web based)
+        * hmm, we can power up/off via CMU so perhaps IPMI is not needed nor this ability via SIM and web browser
     * is head node the Management server? possibly, needs access to provision and public networks
     * we may need a iLo eth? in range ... 192.198.104.x? Consult the Hardware Preparation Guide.
     * CMU wants eth0 on NIC1 and PXEboot
-    * install, configure, monitor
+    * install CMU management node
-    * golden image capture, deploy (there will initially only be one image)
+    * install X and CMU GUI client node
+    * start CMU, start client, scan for nodes, build golden image
+    * install monitoring client when building golden image node via CMU GUI
+    * clone nodes, deploy management agent on nodes
+      * PXEboot and wake-on-lan must be done manually in BIOS
+      * pre_reconf.sh (/localscratch partition?)  and reconf.sh (NIC2 definition)
+    * not sure we can implement CMU HA
+    * collectl/colplot seems nice
   * Sun Grid Engine (SGE)
@@ Line 131: / Line 155: @@
     * where in data center (do later), based on environmental works
+===== ToDo =====
+All do later. After HP cluster is up.
+  * Backups.  /snapshot volume
+  * Use trickery with linux and rsync to provide snapshots? [[http://forum.synology.com/enu/viewtopic.php?f=9&t=11471|External Link]] and another [[http://www.mikerubel.org/computers/rsync_snapshots/|External Link]]
+    * Exclude very large files?
+    * petaltail:/root/snapshot.sh or rotate_backups.sh as examples
+    * or better [[http://www.rsnapshot.org/|http://www.rsnapshot.org/]]
+  * Lava.  Install from source and evaluate.
+  * Location
+    * remove 2 BSS racks (to pace.edu?), rack #3 & 4
+    * add an L6-30 if needed (have 3? check)
+    * fill remaining 2 BSS racks with 24gb good servers, turn off
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools