User Tools

Site Tools


cluster:89

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:89 [2010/08/17 19:06]
hmeij
cluster:89 [2010/10/21 15:29]
hmeij
Line 22: Line 22:
     * depending on switch IP in 192.168.102.x or 10.10.102.x     * depending on switch IP in 192.168.102.x or 10.10.102.x
     * voltaire console can be stuffed in either     * voltaire console can be stuffed in either
 +
 +  * head node will be connected to our private network via a two link aggregated ethernet cables in the 10.10.x.y range so current home directories can be mounted somewhere (these dirs will not be available on the back end nodes.
  
   * x.y.z.255 is broadcast   * x.y.z.255 is broadcast
   * x.y.z.254 is head or log in node   * x.y.z.254 is head or log in node
   * x.y.z.0 is gateway   * x.y.z.0 is gateway
-  * x.y.z.<25 is for all switches and console ports +  * x.y.z.<10 is for all switches and console ports 
-  * x.y.z.25( up to 253) is for all compute nodes+  * x.y.z.10( up to 253) is for all compute nodes
  
 We are planning to ingest our Dell cluster (37 nodes) and our Blue Sky Studios cluster (130 nodes) into this setup, hence the approach. We are planning to ingest our Dell cluster (37 nodes) and our Blue Sky Studios cluster (130 nodes) into this setup, hence the approach.
  
 Netmask is, finally, 255.255.0.0 (excluding public 129.133 subnet). Netmask is, finally, 255.255.0.0 (excluding public 129.133 subnet).
 +
 +===== Infiniband =====
 +
 +[[http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=vn&prodTypeId=12883&prodSeriesId=3758753&lang=en&cc=vn|HP Link]] 
 +
 +  * Voltaire 4036
 +  * 519571-B21
 +  * Voltaire InfiniBand 4X QDR 36-Port Managed Switch
 +
 +
 +Configuration, fine tuning, identify bottlenecks, monitor, administer.  Investigate [[http://www.voltaire.com/Products/Unified_Fabric_Manager|Voltaire UFM]]?
  
 ===== DM380G7 ===== ===== DM380G7 =====
-[[http://h10010.www1.hp.com/wwpc/us/en/sm/WF31a/15351-15351-3328412-241644-241475-4091412.html|HP Link]] (head node)+[[http://h10010.www1.hp.com/wwpc/us/en/sm/WF31a/15351-15351-3328412-241644-241475-4091412.html|HP Link]] (head node)\\ 
 +[[http://vimeo.com/9938744|External Link]] video about hardware
  
   * Dual power (one to UPS, one to utility, do later)   * Dual power (one to UPS, one to utility, do later)
Line 42: Line 56:
     * do we need a iLo eth? in range 192.168.104.254?     * do we need a iLo eth? in range 192.168.104.254?
   * eth1, data/private, 10.10.102.254/255.255.0.0 (greentail-eth1, should go to ProCurve 2610)   * eth1, data/private, 10.10.102.254/255.255.0.0 (greentail-eth1, should go to ProCurve 2610)
-  * eth2, public, 129.133.1.226/255.255.255.0 (greentail.wesleyan.edu)+  * eth2, public, 129.133.1.226/255.255.255.0 (greentail.wesleyan.edu, we provide cable connection)
   * eth3 (over eth2), ipmi, 192.168.103.254/255.255.0.0,  (greentail-ipmi, should go to better switch ProCurve 2910, do later)   * eth3 (over eth2), ipmi, 192.168.103.254/255.255.0.0,  (greentail-ipmi, should go to better switch ProCurve 2910, do later)
     * see discussion iLo/IPMI under CMU     * see discussion iLo/IPMI under CMU
Line 49: Line 63:
  
   * Raid 1 mirrored disks (2x250gb)   * Raid 1 mirrored disks (2x250gb)
-  * /home mount point for home directory volume ~ 10tb +  * /home mount point for home directory volume ~ 10tb (contains /home/apps/src) 
-  * /home/apps mount point for software volume ~ 1tb (contains /home/apps/src) +  * /snapshot mount point for snapshot volume ~ 10tb  
-  * /home/sanscratch mount point for sanscratch volume ~ 5 tb+  * /sanscratch mount point for sanscratch volume ~ 5 tb
   * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (should match nodes at 160 gb, leave rest for OS)   * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (should match nodes at 160 gb, leave rest for OS)
   * logical volumes ROOT/VAR/BOOT/TMP: defaults   * logical volumes ROOT/VAR/BOOT/TMP: defaults
Line 66: Line 80:
  
   * Three volumes to start with:    * Three volumes to start with: 
-    * home (raid 6, design a backup path, do later), 10 tb +    * home (raid 6), 10 tb 
-    * apps (raid 6, design a backup path, do later), 1tb +    * snapshot (raid 6), 10 tb ... see todos. 
-    * sanscratch (raid 1, no backup), 5 tb+    * sanscratch (raid 1 or 0, no backup), 5 tb
  
   * SIM   * SIM
Line 86: Line 100:
     * ib1, ipoib, 10.10.104.25(increment by 1)/255.255.0.0 (hp000-ib1, configure, might not have cables!)     * ib1, ipoib, 10.10.104.25(increment by 1)/255.255.0.0 (hp000-ib1, configure, might not have cables!)
  
-    * /home mount point for home directory volume ~ 10tb +    * /home mount point for home directory volume ~ 10tb (contains /home/apps/src) 
-    * /home/apps mount point for software volume ~ 1tb (contains /home/apps/src) +    * /snapshot mount point for snapshot volume ~ 10tb  
-    * /home/sanscratch mount point for sanscratch volume ~ 5 tb+    * /sanscratch mount point for sanscratch volume ~ 5 tb 
 +    * (next ones must be 50% empty for cloning to work)
     * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (60 gb left for OS)     * logical volume LOCALSCRATCH: mount at /localscratch ~ 100 gb (60 gb left for OS)
     * logical volumes ROOT/VAR/BOOT/TMP: defaults     * logical volumes ROOT/VAR/BOOT/TMP: defaults
Line 100: Line 115:
     * monitor     * monitor
  
-  * Systems Insight Manager (SIM) [[http://h18013.www1.hp.com/products/servers/management/hpsim/index.html?jumpid=go/hpsim|HP Link]] (Linux Install and Configure Guide, and User Guide)+  * Systems Insight Manager (SIM)  
 +  * [[http://h18013.www1.hp.com/products/servers/management/hpsim/index.html?jumpid=go/hpsim|HP Link]] (Linux Install and Configure Guide, and User Guide)
     * Do we need a windows box (virtual) to run the Central Management Server on?     * Do we need a windows box (virtual) to run the Central Management Server on?
     * SIM + Cluster Monitor (MSCS)?     * SIM + Cluster Monitor (MSCS)?
Line 109: Line 125:
     * configure automatic event handling     * configure automatic event handling
  
-  * Cluster Management Utility (CMU)[[http://h20338.www2.hp.com/HPC/cache/412128-0-0-0-121.html|HP Link]] (Getting Started - Hardware Preparation, Setup and Install -- Installation Guide v4.2, Users Guides) +  * Cluster Management Utility (CMU up to 4,096 nodes) 
-  * iLo/IPMI+  * [[http://h20338.www2.hp.com/HPC/cache/412128-0-0-0-121.html|HP Link]] (Getting Started - Hardware Preparation, Setup and Install -- Installation Guide v4.2, Users Guides)
     * HP iLo probably removes the need for IPMI, consult [[http://en.wikipedia.org/wiki/HP_Integrated_Lights-Out|External Link]], do the blades have a management card?     * HP iLo probably removes the need for IPMI, consult [[http://en.wikipedia.org/wiki/HP_Integrated_Lights-Out|External Link]], do the blades have a management card?
-    * well maybe not, IPMI ([[http://en.wikipedia.org/wiki/Ipmi|External Link]]) can be scripted to power on/off, not sure about iLo (all web based) +      * well maybe not, IPMI ([[http://en.wikipedia.org/wiki/Ipmi|External Link]]) can be scripted to power on/off, not sure about iLo (all web based) 
 +        * hmm, we can power up/off via CMU so perhaps IPMI is not needed nor this ability via SIM and web browser 
     * is head node the Management server? possibly, needs access to provision and public networks     * is head node the Management server? possibly, needs access to provision and public networks
     * we may need a iLo eth? in range ... 192.198.104.x? Consult the Hardware Preparation Guide.     * we may need a iLo eth? in range ... 192.198.104.x? Consult the Hardware Preparation Guide.
     * CMU wants eth0 on NIC1 and PXEboot     * CMU wants eth0 on NIC1 and PXEboot
-    * install, configuremonitor +    * install CMU management node 
-    * golden image capture, deploy (there will initially only be one image)+    * install X and CMU GUI client node 
 +    * start CMUstart clientscan for nodes, build golden image 
 +    * install monitoring client when building golden image node via CMU GUI 
 +    * clone nodes, deploy management agent on nodes 
 +      * PXEboot and wake-on-lan must be done manually in BIOS 
 +      * pre_reconf.sh (/localscratch partition? and reconf.sh (NIC2 definition) 
 +    * not sure we can implement CMU HA 
 +    * collectl/colplot seems nice
  
   * Sun Grid Engine (SGE)   * Sun Grid Engine (SGE)
Line 131: Line 155:
     * where in data center (do later), based on environmental works     * where in data center (do later), based on environmental works
  
 +===== ToDo =====
 +
 +All do later. After HP cluster is up.
 +
 +  * Backups.  /snapshot volume
 +  * Use trickery with linux and rsync to provide snapshots? [[http://forum.synology.com/enu/viewtopic.php?f=9&t=11471|External Link]] and another [[http://www.mikerubel.org/computers/rsync_snapshots/|External Link]]
 +    * Exclude very large files?
 +    * petaltail:/root/snapshot.sh or rotate_backups.sh as examples
 +    * or better [[http://www.rsnapshot.org/|http://www.rsnapshot.org/]]
 +
 +  * Lava.  Install from source and evaluate.
 +
 +  * Location
 +    * remove 2 BSS racks (to pace.edu?), rack #3 & 4
 +    * add an L6-30 if needed (have 3? check)
 +    * fill remaining 2 BSS racks with 24gb good servers, turn off
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/89.txt · Last modified: 2010/11/22 19:05 by hmeij