User Tools

Site Tools


cluster:114

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:114 [2013/05/17 09:46]
hmeij [Hadoop Cluster]
cluster:114 [2013/09/10 14:59] (current)
hmeij
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
-==== Hadoop Cluster ====+==== Build Hadoop (test) Cluster ====
  
-These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review theese sites so this page makes sense to you.+[[cluster:115|Use Hadoop (test) Cluster]] 
 + 
 +These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added.  Please review these sites so this page makes sense to you.
  
   * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]]    * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]] 
Line 13: Line 15:
   * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]   * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]]
   * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]   * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]]
-  * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster]] +  * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]] 
-  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticledm-209hadoopbigdata/]]+  * IBM article [[http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/]] 
 + 
 +And 
 + 
 +  * White [[http://hadoopbook.com/]] 
 + 
 +==== Building ==== 
 + 
 +  * Deployed 8 virtual machines, Oracle Linux 6, 64 bit, bare bones. 
 +  * Each node has 1 GB ram and a 36 GB hard disk. 
 + 
 +  * First get rid of OpenJDK if it's in your VMware template 
 +    * Consult CTOvision on how to do that. 
 +  * The download latest Java packages from Oracle and install. 
 +  * Everything below is done by root. 
 + 
 +<code> 
 +# all nodes, i used pdsh to spawn commands across all nodes 
 +rpm -ivh /usr/local/src/jdk-7u21-linux-x64.rpm                     
 +rpm -ivh /usr/local/src/jre-7u21-linux-x64.rpm                                                                         
 +alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600  
 +alternatives --auto java 
 +# fix this as some Hadoop scripts look at this location 
 +cd /usr/java 
 +ln -s ./latest/bin 
 +which java  
 +java -version 
 +</code>                                               
 + 
 +  * Next set up the Cloudera repository 
 + 
 +<code> 
 +# all nodes 
 +cd /etc/yum.repos.d/ 
 +wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo 
 +yum update                                                              
 +yum install hadoop-0.20 
 +</code>                                               
 + 
 +  * Selinux, again ... 
 + 
 +<code> 
 +setenforce 0 
 +# edit this file and disable   
 +vi /etc/selinux/config  
 +</code> 
 + 
 +  * Ports, the node need to talk to each other as well as allow admin pages to load 
 + 
 +<code> 
 +# edit this file and restart iptables      
 +vi /etc/sysconfig/iptables 
 +# hadoop 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50070 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50075 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50090 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50105 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50030 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50060 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8020 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50010 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50020 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50100 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8021 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 9001 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8012 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54310 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54311 -j ACCEPT 
 +# plus 127.0.0.1:0 and maybe 9000 
 +# hadoop admin status 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50030 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50070 -j ACCEPT 
 +-A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50075 -j ACCEPT 
 +</code> 
 + 
 +  * Install the **headnode** node and tracker 
 + 
 +<code> 
 +# head node 
 +yum -y install hadoop-0.20-namenode 
 +yum -y install hadoop-0.20-jobtracker 
 +</code> 
 + 
 +  * On all the work nodes 
 + 
 +<code> 
 +# data node 
 +yum -y install hadoop-0.20-datanode                                     
 +yum -y install hadoop-0.20-tasktracker    
 +</code>                               
 +      
 +  * Next set up the configuration environment 
 +  * Edit the conf files, consult Dakini site for content 
 +  * Copy those 3 files to all work nodes 
 +  * The display command should point to the MyCluster files 
 + 
 +<code> 
 +# all nodes                                                             
 +cp -r /etc/hadoop-0.20/conf.empty   /etc/hadoop-0.20/conf.MyCluster 
 +alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50 
 +alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 
 +alternatives --display hadoop-0.20-conf 
 +vi /etc/hadoop-0.20/conf.MyCluster/core-site.xml 
 +vi /etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml 
 +vi /etc/hadoop-0.20/conf.MyCluster/mapred-site.xml 
 +</code> 
 + 
 +  * Since this is a test cluster I located the DHFS filesystem on the OS disk 
 +      * In a production environment you'd want multiple dedicated disks per node 
 + 
 +<code> 
 +# all nodes 
 +mkdir -p /mnt/hdfs/
 +mkdir -p /mnt/hdfs/1/namenode 
 +mkdir -p /mnt/hdfs/1/datanode 
 +mkdir -p /mnt/hdfs/1/mapred 
 +chown -R hdfs:hadoop /mnt/hdfs 
 +chown -R mapred:hadoop /mnt/hdfs/1/mapred 
 +</code> 
 + 
 +  * Format HDFS! Very important. Do ONLY ONCE on head node. 
 + 
 +<code> 
 +# headnode only 
 +sudo -u hdfs hadoop namenode -format 
 +</code> 
 + 
 +  * Fix permissions 
 + 
 +<code> 
 +# all nodes 
 +chgrp hdfs /usr/lib/hadoop-0.20/ 
 +chmod g+rw /usr/lib/hadoop-0.20/ 
 +</code> 
 + 
 +  * Start Hadoop nodes and trackers 
 +  * If you receive the dreaded "datanode dead but pid exists" error 
 +  * Check the log in question it'll give a hint 
 +    * You may have typos in the XML files and configuration does not load 
 +    * File permissions may prevent nodes and trackers from starting 
 +    * You missed a step, like in the alternatives commands 
 +    * You issued the HDFS format command multiples times 
 + 
 +<code> 
 +# head node 
 +/etc/init.d/hadoop-0.20-namenode start 
 +/etc/init.d/hadoop-0.20-jobtracker start 
 + 
 +# work nodes 
 +/etc/init.d/hadoop-0.20-datanode start 
 +/etc/init.d/hadoop-0.20-tasktracker start 
 +</code> 
 + 
 +  * Alright, lets some filesystem entries 
 + 
 +<code> 
 +# head node only 
 +sudo -u hdfs hadoop fs -mkdir /mapred/system 
 +sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system 
 +sudo -u hdfs hadoop dfs -mkdir /tmp 
 +sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp 
 +</code> 
 + 
 +  * Command line health check 
 + 
 +<code> 
 +sudo -u hdfs hadoop dfsadmin -report 
 +sudo -u hdfs hadoop dfs -df 
 +</code> 
 + 
 +  * And from a remote machine access your head node 
 +    * Hadoop Map/Reduce Administration 
 +      * [[http://headnode.wesleyan.edu:50030]]  
 +    * The Namenode 
 +      * [[http://headnode.wesleyan.edu:50070]] 
 + 
 +TODO
  
 +  * Run some jobs
 +  * Find a MOOC course
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/114.1368798419.txt.gz · Last modified: 2013/05/17 09:46 by hmeij