\\ **[[cluster:0|Back]]** ==== Build Hadoop (test) Cluster ==== [[cluster:115|Use Hadoop (test) Cluster]] These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added. Please review these sites so this page makes sense to you. * CTOvision [[http://ctovision.com/2012/01/cloudera-hadoop-quickstart/]] * Dakini [[http://dak1n1.com/blog/9-hadoop-el6-install]] Other sites you want to read: * Yahoo [[http://developer.yahoo.com/hadoop/tutorial/]] * Apache [[http://hadoop.apache.org/docs/r1.0.4/index.html]] * Noll [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]] * IBM article [[http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/]] And * White [[http://hadoopbook.com/]] ==== Building ==== * Deployed 8 virtual machines, Oracle Linux 6, 64 bit, bare bones. * Each node has 1 GB ram and a 36 GB hard disk. * First get rid of OpenJDK if it's in your VMware template * Consult CTOvision on how to do that. * The download latest Java packages from Oracle and install. * Everything below is done by root. # all nodes, i used pdsh to spawn commands across all nodes rpm -ivh /usr/local/src/jdk-7u21-linux-x64.rpm rpm -ivh /usr/local/src/jre-7u21-linux-x64.rpm alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600 alternatives --auto java # fix this as some Hadoop scripts look at this location cd /usr/java ln -s ./latest/bin which java java -version * Next set up the Cloudera repository # all nodes cd /etc/yum.repos.d/ wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo yum update yum install hadoop-0.20 * Selinux, again ... setenforce 0 # edit this file and disable vi /etc/selinux/config * Ports, the node need to talk to each other as well as allow admin pages to load # edit this file and restart iptables vi /etc/sysconfig/iptables # hadoop -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50070 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50075 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50090 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50105 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50030 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50060 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8020 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50010 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50020 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 50100 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8021 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 9001 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 8012 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54310 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -m iprange --src-range 129.133.x.xxx-129.133.x.xxx --dport 54311 -j ACCEPT # plus 127.0.0.1:0 and maybe 9000 # hadoop admin status -A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50030 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50070 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp -s 129.133.0.0/16 --dport 50075 -j ACCEPT * Install the **headnode** node and tracker # head node yum -y install hadoop-0.20-namenode yum -y install hadoop-0.20-jobtracker * On all the work nodes # data node yum -y install hadoop-0.20-datanode yum -y install hadoop-0.20-tasktracker * Next set up the configuration environment * Edit the conf files, consult Dakini site for content * Copy those 3 files to all work nodes * The display command should point to the MyCluster files # all nodes cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.MyCluster alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50 alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster alternatives --display hadoop-0.20-conf vi /etc/hadoop-0.20/conf.MyCluster/core-site.xml vi /etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml vi /etc/hadoop-0.20/conf.MyCluster/mapred-site.xml * Since this is a test cluster I located the DHFS filesystem on the OS disk * In a production environment you'd want multiple dedicated disks per node # all nodes mkdir -p /mnt/hdfs/1 mkdir -p /mnt/hdfs/1/namenode mkdir -p /mnt/hdfs/1/datanode mkdir -p /mnt/hdfs/1/mapred chown -R hdfs:hadoop /mnt/hdfs chown -R mapred:hadoop /mnt/hdfs/1/mapred * Format HDFS! Very important. Do ONLY ONCE on head node. # headnode only sudo -u hdfs hadoop namenode -format * Fix permissions # all nodes chgrp hdfs /usr/lib/hadoop-0.20/ chmod g+rw /usr/lib/hadoop-0.20/ * Start Hadoop nodes and trackers * If you receive the dreaded "datanode dead but pid exists" error * Check the log in question it'll give a hint * You may have typos in the XML files and configuration does not load * File permissions may prevent nodes and trackers from starting * You missed a step, like in the alternatives commands * You issued the HDFS format command multiples times # head node /etc/init.d/hadoop-0.20-namenode start /etc/init.d/hadoop-0.20-jobtracker start # work nodes /etc/init.d/hadoop-0.20-datanode start /etc/init.d/hadoop-0.20-tasktracker start * Alright, lets some filesystem entries # head node only sudo -u hdfs hadoop fs -mkdir /mapred/system sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system sudo -u hdfs hadoop dfs -mkdir /tmp sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp * Command line health check sudo -u hdfs hadoop dfsadmin -report sudo -u hdfs hadoop dfs -df * And from a remote machine access your head node * Hadoop Map/Reduce Administration * [[http://headnode.wesleyan.edu:50030]] * The Namenode * [[http://headnode.wesleyan.edu:50070]] TODO * Run some jobs * Find a MOOC course \\ **[[cluster:0|Back]]**