Differences

This shows you the differences between two versions of the page.

--- cluster:114 [2013/05/17 13:46] – [Hadoop Cluster] hmeij
+++ cluster:114 [2013/09/10 18:59] (current) – hmeij
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
-==== Hadoop Cluster ====
+==== Build Hadoop (test) Cluster ====
-These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review theese sites so this page makes sense to you.
+[[cluster:115|Use Hadoop (test) Cluster]]
+These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review these sites so this page makes sense to you.
   * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]]
@@ Line 13: / Line 15: @@
   * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]
   * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]
-  * Noll [[www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster]]
+  * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]]
-  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticledm-209hadoopbigdata/]]
+  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/]]
+And
+  * White [[http://hadoopbook.com/]]
+==== Building ====
+  * Deployed 8 virtual machines, Oracle Linux 6, 64 bit, bare bones.
+  * Each node has 1 GB ram and a 36 GB hard disk.
+  * First get rid of OpenJDK if it's in your VMware template
+    * Consult CTOvision on how to do that.
+  * The download latest Java packages from Oracle and install.
+  * Everything below is done by root.
+<code>
+# all nodes, i used pdsh to spawn commands across all nodes
+rpm -ivh /usr/local/src/jdk-7u21-linux-x64.rpm
+rpm -ivh /usr/local/src/jre-7u21-linux-x64.rpm
+alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600
+alternatives --auto java
+# fix this as some Hadoop scripts look at this location
+cd /usr/java
+ln -s ./latest/bin
+which java
+java -version
+</code>
+  * Next set up the Cloudera repository
+<code>
+# all nodes
+cd /etc/yum.repos.d/
+wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo
+yum update
+yum install hadoop-0.20
+</code>
+  * Selinux, again ...
+<code>
+setenforce 0
+# edit this file and disable
+vi /etc/selinux/config
+</code>
+  * Ports, the node need to talk to each other as well as allow admin pages to load
+<code>
+# edit this file and restart iptables
+vi /etc/sysconfig/iptables
+# hadoop
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50070 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50075 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50090 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50105 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50030 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50060 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8020 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50010 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50020 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50100 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8021 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 9001 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8012 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54310 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54311 -j ACCEPT
+# plus 127.0.0.1:0 and maybe 9000
+# hadoop admin status
+-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50030 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50070 -j ACCEPT
+-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50075 -j ACCEPT
+</code>
+  * Install the **headnode** node and tracker
+<code>
+# head node
+yum -y install hadoop-0.20-namenode
+yum -y install hadoop-0.20-jobtracker
+</code>
+  * On all the work nodes
+<code>
+# data node
+yum -y install hadoop-0.20-datanode
+yum -y install hadoop-0.20-tasktracker
+</code>
+  * Next set up the configuration environment
+  * Edit the conf files, consult Dakini site for content
+  * Copy those 3 files to all work nodes
+  * The display command should point to the MyCluster files
+<code>
+# all nodes
+cp -r /etc/hadoop-0.20/conf.empty   /etc/hadoop-0.20/conf.MyCluster
+alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50
+alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster
+alternatives --display hadoop-0.20-conf
+vi /etc/hadoop-0.20/conf.MyCluster/core-site.xml
+vi /etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
+vi /etc/hadoop-0.20/conf.MyCluster/mapred-site.xml
+</code>
+  * Since this is a test cluster I located the DHFS filesystem on the OS disk
+      * In a production environment you'd want multiple dedicated disks per node
+<code>
+# all nodes
+mkdir -p /mnt/hdfs/1
+mkdir -p /mnt/hdfs/1/namenode
+mkdir -p /mnt/hdfs/1/datanode
+mkdir -p /mnt/hdfs/1/mapred
+chown -R hdfs:hadoop /mnt/hdfs
+chown -R mapred:hadoop /mnt/hdfs/1/mapred
+</code>
+  * Format HDFS! Very important. Do ONLY ONCE on head node.
+<code>
+# headnode only
+sudo -u hdfs hadoop namenode -format
+</code>
+  * Fix permissions
+<code>
+# all nodes
+chgrp hdfs /usr/lib/hadoop-0.20/
+chmod g+rw /usr/lib/hadoop-0.20/
+</code>
+  * Start Hadoop nodes and trackers
+  * If you receive the dreaded "datanode dead but pid exists" error
+  * Check the log in question it'll give a hint
+    * You may have typos in the XML files and configuration does not load
+    * File permissions may prevent nodes and trackers from starting
+    * You missed a step, like in the alternatives commands
+    * You issued the HDFS format command multiples times
+<code>
+# head node
+/etc/init.d/hadoop-0.20-namenode start
+/etc/init.d/hadoop-0.20-jobtracker start
+# work nodes
+/etc/init.d/hadoop-0.20-datanode start
+/etc/init.d/hadoop-0.20-tasktracker start
+</code>
+  * Alright, lets some filesystem entries
+<code>
+# head node only
+sudo -u hdfs hadoop fs -mkdir /mapred/system
+sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
+sudo -u hdfs hadoop dfs -mkdir /tmp
+sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp
+</code>
+  * Command line health check
+<code>
+sudo -u hdfs hadoop dfsadmin -report
+sudo -u hdfs hadoop dfs -df
+</code>
+  * And from a remote machine access your head node
+    * Hadoop Map/Reduce Administration
+      * [[http://headnode.wesleyan.edu:50030]]
+    * The Namenode
+      * [[http://headnode.wesleyan.edu:50070]]
+TODO
+  * Run some jobs
+  * Find a MOOC course
 \\
 **[[cluster:0|Back]]**