User Tools

Site Tools


cluster:114

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:114 [2013/05/17 09:46]
hmeij [Hadoop Cluster]
cluster:114 [2013/09/10 14:59]
hmeij
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
-==== Hadoop Cluster ====+==== Build Hadoop (test) Cluster ====
  
-These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review theese sites so this page makes sense to you.+[[cluster:115|Use Hadoop (test) Cluster]] 
 + 
 +These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review these sites so this page makes sense to you.
  
   * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]]    * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]] 
Line 13: Line 15:
   * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]   * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]
   * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]   * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]
-  * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster]] +  * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]] 
-  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticledm-209hadoopbigdata/]]+  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/]] 
 + 
 +And 
 + 
 +  * White [[http://hadoopbook.com/]] 
 + 
 +==== Building ==== 
 + 
 +  * Deployed 8 virtual machines, Oracle Linux 6, 64 bit, bare bones. 
 +  * Each node has 1 GB ram and a 36 GB hard disk. 
 + 
 +  * First get rid of OpenJDK if it's in your VMware template 
 +    * Consult CTOvision on how to do that. 
 +  * The download latest Java packages from Oracle and install. 
 +  * Everything below is done by root. 
 + 
 +<code> 
 +# all nodes, i used pdsh to spawn commands across all nodes 
 +rpm -ivh /usr/local/src/jdk-7u21-linux-x64.rpm                     
 +rpm -ivh /usr/local/src/jre-7u21-linux-x64.rpm                                                                         
 +alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600  
 +alternatives --auto java 
 +# fix this as some Hadoop scripts look at this location 
 +cd /usr/java 
 +ln -s ./latest/bin 
 +which java  
 +java -version 
 +</code>                                               
 + 
 +  * Next set up the Cloudera repository 
 + 
 +<code> 
 +# all nodes 
 +cd /etc/yum.repos.d/ 
 +wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo 
 +yum update                                                              
 +yum install hadoop-0.20 
 +</code>                                               
 + 
 +  * Selinux, again ... 
 + 
 +<code> 
 +setenforce 0 
 +# edit this file and disable   
 +vi /etc/selinux/config  
 +</code> 
 + 
 +  * Ports, the node need to talk to each other as well as allow admin pages to load 
 + 
 +<code> 
 +# edit this file and restart iptables      
 +vi /etc/sysconfig/iptables 
 +# hadoop 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50070 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50075 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50090 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50105 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50030 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50060 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8020 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50010 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50020 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50100 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8021 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 9001 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8012 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54310 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54311 -j ACCEPT 
 +# plus 127.0.0.1:0 and maybe 9000 
 +# hadoop admin status 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50030 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50070 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50075 -j ACCEPT 
 +</code> 
 + 
 +  * Install the **headnode** node and tracker 
 + 
 +<code> 
 +# head node 
 +yum -y install hadoop-0.20-namenode 
 +yum -y install hadoop-0.20-jobtracker 
 +</code> 
 + 
 +  * On all the work nodes 
 + 
 +<code> 
 +# data node 
 +yum -y install hadoop-0.20-datanode                                     
 +yum -y install hadoop-0.20-tasktracker    
 +</code>                               
 +      
 +  * Next set up the configuration environment 
 +  * Edit the conf files, consult Dakini site for content 
 +  * Copy those 3 files to all work nodes 
 +  * The display command should point to the MyCluster files 
 + 
 +<code> 
 +# all nodes                                                             
 +cp -r /etc/hadoop-0.20/conf.empty   /etc/hadoop-0.20/conf.MyCluster 
 +alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50 
 +alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 
 +alternatives --display hadoop-0.20-conf 
 +vi /etc/hadoop-0.20/conf.MyCluster/core-site.xml 
 +vi /etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml 
 +vi /etc/hadoop-0.20/conf.MyCluster/mapred-site.xml 
 +</code> 
 + 
 +  * Since this is a test cluster I located the DHFS filesystem on the OS disk 
 +      * In a production environment you'd want multiple dedicated disks per node 
 + 
 +<code> 
 +# all nodes 
 +mkdir -p /mnt/hdfs/
 +mkdir -p /mnt/hdfs/1/namenode 
 +mkdir -p /mnt/hdfs/1/datanode 
 +mkdir -p /mnt/hdfs/1/mapred 
 +chown -R hdfs:hadoop /mnt/hdfs 
 +chown -R mapred:hadoop /mnt/hdfs/1/mapred 
 +</code> 
 + 
 +  * Format HDFS! Very important. Do ONLY ONCE on head node. 
 + 
 +<code> 
 +# headnode only 
 +sudo -u hdfs hadoop namenode -format 
 +</code> 
 + 
 +  * Fix permissions 
 + 
 +<code> 
 +# all nodes 
 +chgrp hdfs /usr/lib/hadoop-0.20/ 
 +chmod g+rw /usr/lib/hadoop-0.20/ 
 +</code> 
 + 
 +  * Start Hadoop nodes and trackers 
 +  * If you receive the dreaded "datanode dead but pid exists" error 
 +  * Check the log in question it'll give a hint 
 +    * You may have typos in the XML files and configuration does not load 
 +    * File permissions may prevent nodes and trackers from starting 
 +    * You missed a step, like in the alternatives commands 
 +    * You issued the HDFS format command multiples times 
 + 
 +<code> 
 +# head node 
 +/etc/init.d/hadoop-0.20-namenode start 
 +/etc/init.d/hadoop-0.20-jobtracker start 
 + 
 +# work nodes 
 +/etc/init.d/hadoop-0.20-datanode start 
 +/etc/init.d/hadoop-0.20-tasktracker start 
 +</code> 
 + 
 +  * Alright, lets some filesystem entries 
 + 
 +<code> 
 +# head node only 
 +sudo -u hdfs hadoop fs -mkdir /mapred/system 
 +sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system 
 +sudo -u hdfs hadoop dfs -mkdir /tmp 
 +sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp 
 +</code> 
 + 
 +  * Command line health check 
 + 
 +<code> 
 +sudo -u hdfs hadoop dfsadmin -report 
 +sudo -u hdfs hadoop dfs -df 
 +</code> 
 + 
 +  * And from a remote machine access your head node 
 +    * Hadoop Map/Reduce Administration 
 +      * [[http://headnode.wesleyan.edu:50030]]  
 +    * The Namenode 
 +      * [[http://headnode.wesleyan.edu:50070]] 
 + 
 +TODO
  
 +  * Run some jobs
 +  * Find a MOOC course
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/114.txt · Last modified: 2013/09/10 14:59 by hmeij