\\
**[[cluster:0|Back]]**
==== Build Hadoop (test) Cluster ====
[[cluster:115|Use Hadoop (test) Cluster]]
These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added. Please review these sites so this page makes sense to you.
* CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]]
* Dakini [[http://dak1n1.com/blog/9-hadoop-el6-install]]
Other sites you want to read:
* Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]
* Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]
* Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]]
* IBM article [[http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/]]
And
* White [[http://hadoopbook.com/]]
==== Building ====
* Deployed 8 virtual machines, Oracle Linux 6, 64 bit, bare bones.
* Each node has 1 GB ram and a 36 GB hard disk.
* First get rid of OpenJDK if it's in your VMware template
* Consult CTOvision on how to do that.
* The download latest Java packages from Oracle and install.
* Everything below is done by root.
# all nodes, i used pdsh to spawn commands across all nodes
rpm -ivh /usr/local/src/jdk-7u21-linux-x64.rpm
rpm -ivh /usr/local/src/jre-7u21-linux-x64.rpm
alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600
alternatives --auto java
# fix this as some Hadoop scripts look at this location
cd /usr/java
ln -s ./latest/bin
which java
java -version
* Next set up the Cloudera repository
# all nodes
cd /etc/yum.repos.d/
wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo
yum update
yum install hadoop-0.20
* Selinux, again ...
setenforce 0
# edit this file and disable
vi /etc/selinux/config
* Ports, the node need to talk to each other as well as allow admin pages to load
# edit this file and restart iptables
vi /etc/sysconfig/iptables
# hadoop
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50070 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50075 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50090 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50105 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50030 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50060 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8020 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50010 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50020 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50100 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8021 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 9001 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8012 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54310 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54311 -j ACCEPT
# plus 127.0.0.1:0 and maybe 9000
# hadoop admin status
-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50030 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50070 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50075 -j ACCEPT
* Install the **headnode** node and tracker
# head node
yum -y install hadoop-0.20-namenode
yum -y install hadoop-0.20-jobtracker
* On all the work nodes
# data node
yum -y install hadoop-0.20-datanode
yum -y install hadoop-0.20-tasktracker
* Next set up the configuration environment
* Edit the conf files, consult Dakini site for content
* Copy those 3 files to all work nodes
* The display command should point to the MyCluster files
# all nodes
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.MyCluster
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50
alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster
alternatives --display hadoop-0.20-conf
vi /etc/hadoop-0.20/conf.MyCluster/core-site.xml
vi /etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
vi /etc/hadoop-0.20/conf.MyCluster/mapred-site.xml
* Since this is a test cluster I located the DHFS filesystem on the OS disk
* In a production environment you'd want multiple dedicated disks per node
# all nodes
mkdir -p /mnt/hdfs/1
mkdir -p /mnt/hdfs/1/namenode
mkdir -p /mnt/hdfs/1/datanode
mkdir -p /mnt/hdfs/1/mapred
chown -R hdfs:hadoop /mnt/hdfs
chown -R mapred:hadoop /mnt/hdfs/1/mapred
* Format HDFS! Very important. Do ONLY ONCE on head node.
# headnode only
sudo -u hdfs hadoop namenode -format
* Fix permissions
# all nodes
chgrp hdfs /usr/lib/hadoop-0.20/
chmod g+rw /usr/lib/hadoop-0.20/
* Start Hadoop nodes and trackers
* If you receive the dreaded "datanode dead but pid exists" error
* Check the log in question it'll give a hint
* You may have typos in the XML files and configuration does not load
* File permissions may prevent nodes and trackers from starting
* You missed a step, like in the alternatives commands
* You issued the HDFS format command multiples times
# head node
/etc/init.d/hadoop-0.20-namenode start
/etc/init.d/hadoop-0.20-jobtracker start
# work nodes
/etc/init.d/hadoop-0.20-datanode start
/etc/init.d/hadoop-0.20-tasktracker start
* Alright, lets some filesystem entries
# head node only
sudo -u hdfs hadoop fs -mkdir /mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
sudo -u hdfs hadoop dfs -mkdir /tmp
sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp
* Command line health check
sudo -u hdfs hadoop dfsadmin -report
sudo -u hdfs hadoop dfs -df
* And from a remote machine access your head node
* Hadoop Map/Reduce Administration
* [[http://headnode.wesleyan.edu:50030]]
* The Namenode
* [[http://headnode.wesleyan.edu:50070]]
TODO
* Run some jobs
* Find a MOOC course
\\
**[[cluster:0|Back]]**