User Tools

Site Tools


cluster:115

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:115 [2013/05/28 09:04]
hmeij [Perl Hadoop's native Streaming]
cluster:115 [2013/09/10 15:04] (current)
hmeij [Rhadoop]
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
-===== Use Hadoop (test) Cluster =====+===== Use Hadoop Cluster =====
  
 [[cluster:114|Build Hadoop Cluster]] [[cluster:114|Build Hadoop Cluster]]
Line 186: Line 186:
  
   * a bit involved ... [[http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]]   * a bit involved ... [[http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]]
 +
 +Here are my steps to get this working (with lots of Ross's help) for rmr2 and rhdfs installation. Do this on all nodes.
 +
 +  * Add EPEL repository to your yum installation, then
 +  * yum install R, which pulls in 
 +
 +<code>
 +R-core-3.0.0-2.el6.x86_64
 +R-java-devel-3.0.0-2.el6.x86_64
 +R-devel-3.0.0-2.el6.x86_64
 +R-core-devel-3.0.0-2.el6.x86_64
 +R-java-3.0.0-2.el6.x86_64
 +R-3.0.0-2.el6.x86_64
 +</code>
 +
 +
 +Make sure java is installed properly (the one you used for Hadoop itself) and set ENV in /etc/profile
 +
 +<code>
 +export JAVA_HOME="/usr/java/latest"
 +export PATH=/usr/java/latest/bin:$PATH
 +
 +export HADOOP_HOME=/usr/lib/hadoop-0.20
 +export HADOOP_CMD=/usr/bin/hadoop
 +export HADOOP_STREAMING=/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar
 +</code>
 +
 +I noticed that at soome point openJDK is reinstalled so I managed these links
 +
 +<code>
 +lrwxrwxrwx  1 root root 24 May 27 10:41 /usr/bin/jar -> /usr/java/latest/bin/jar
 +lrwxrwxrwx  1 root root 21 May 27 09:47 /usr/bin/jar-alt -> /etc/alternatives/jar
 +lrwxrwxrwx  1 root root 30 May 27 10:41 /usr/bin/jarsigner -> /usr/java/latest/bin/jarsigner
 +lrwxrwxrwx  1 root root 27 May 27 09:47 /usr/bin/jarsigner-alt -> /etc/alternatives/jarsigner
 +lrwxrwxrwx  1 root root 25 May 27 10:35 /usr/bin/java -> /usr/java/latest/bin/java
 +lrwxrwxrwx  1 root root 22 May 27 09:47 /usr/bin/java-alt -> /etc/alternatives/java
 +lrwxrwxrwx  1 root root 26 May 27 10:38 /usr/bin/javac -> /usr/java/latest/bin/javac
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javac-alt -> /etc/alternatives/javac
 +lrwxrwxrwx  1 root root 25 May 27 09:47 /usr/bin/javadoc -> /etc/alternatives/javadoc
 +lrwxrwxrwx  1 root root 26 May 28 09:37 /usr/bin/javah -> /usr/java/latest/bin/javah
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javah-alt -> /etc/alternatives/javah
 +lrwxrwxrwx  1 root root 26 May 27 10:39 /usr/bin/javap -> /usr/java/latest/bin/javap
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javap-alt -> /etc/alternatives/javap
 +lrwxrwxrwx  1 root root 27 May 27 10:40 /usr/bin/javaws -> /usr/java/latest/bin/javaws
 +lrwxrwxrwx. 1 root root 28 May 15 14:56 /usr/bin/javaws-alt -> /usr/java/default/bin/javaws
 +</code>
 +
 +So if commands ''which java'' and ''java -version'' return the proper information, reconfigure java in R. At the OS prompt
 +
 +<code>
 +# at OS
 +R CMD javareconf
 +# in R
 +install.packages('rJava')
 +</code>
 +
 +You could also set java in this file: $HADOOP_HOME/conf/hadoop-env.sh
 +
 +When that successful, add dependencies:
 +
 +See the following files for current lists of dependencies:
 +
 +[[https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/DESCRIPTION]]
 +[[https://github.com/RevolutionAnalytics/rhdfs/blob/master/pkg/DESCRIPTION]]
 +
 +Enter R and issues the command
 +
 +<code>
 +install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))
 +</code>
 +
 +If Rccp is a problem: locate and install an older version of Rcpp in the CRAN archives (http://cran.r-project.org/src/contrib/Archive/Rcpp/):
 +
 +<code>
 +# in R
 +install.packages("int64")
 +# at OS
 +wget http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.9.8.tar.gz
 +R CMD INSTALL Rcpp_0.9.8.tar.gz
 +</code>
 +
 +Finally the RHadoop stuff, at The OS level
 +
 +<code>
 +wget -O rmr-2.2.0.tar.gz http://goo.gl/bhCU6
 +wget -O rhdfs_1.0.5.tar.gz https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz?raw=true
 +
 +R CMD INSTALL rmr-2.2.0.tar.gz
 +R CMD INSTALL rhdfs_1.0.5.tar.gz
 +</code>
 +
 +Verify
 +
 +<code>
 +Type 'q()' to quit R.
 +
 +> library(rmr2)
 +Loading required package: Rcpp
 +Loading required package: RJSONIO
 +Loading required package: digest
 +Loading required package: functional
 +Loading required package: stringr
 +Loading required package: plyr
 +Loading required package: reshape2
 +> library(rhdfs)
 +Loading required package: rJava
 +
 +HADOOP_CMD=/usr/bin/hadoop
 +
 +Be sure to run hdfs.init()
 +> sessionInfo()
 +R version 3.0.0 (2013-04-03)
 +Platform: x86_64-redhat-linux-gnu (64-bit)
 +
 +locale:
 + [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 + [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 + [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 + [7] LC_PAPER=C                 LC_NAME=C
 + [9] LC_ADDRESS=C               LC_TELEPHONE=C
 +[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
 +
 +attached base packages:
 +[1] stats     graphics  grDevices utils     datasets  methods   base
 +
 +other attached packages:
 + [1] rhdfs_1.0.5    rJava_0.9-4    rmr2_2.2.0     reshape2_1.2.2 plyr_1.8
 + [6] stringr_0.6.2  functional_0.4 digest_0.6.3   RJSONIO_1.0-3  Rcpp_0.10.3
 +
 +</code>
 +
 +Test
 +
 +Tutorial documentation: [[https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md]]
 +
 +R script:
 +
 +<code>
 +#!/usr/bin/Rscript
 +
 +library(rmr2)
 +library(rhdfs)
 +hdfs.init()
 +
 +small.ints = to.dfs(1:1000)
 +mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
 +</code>
 +
 +Then Hbase for Rhbase:
 +
 +[[http://hbase.apache.org/book/configuration.html]]
 +
 +But first Trift, the language interface to the database Hbase:
 +
 +<code>
 +yum install openssl098e
 +</code>
 +
 +Download Trift: [[http://thrift.apache.org/download/]]
 +
 +<code>
 +yum install byacc -y
 +yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
 +
 +./configure
 +make
 +make install
 +export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
 +pkg-config --cflags thrift
 +cp -p /usr/local/lib/libthrift-0.9.0.so /usr/lib/
 +
 +HBASE_ROOT/bin/hbase thrift start &
 +lsof -i:9090 that is server, port 9095 is monitor
 +
 +</code>
 +
 +Configure for distributed environment: [[http://hbase.apache.org/book/standalone_dist.html#standalone]]
 +
 +  * used 3 zookeepers with quorum, see config example online
 +  * start with rolling_restart, the start & stop have a timing issue
 +  * /hbase owened by root:root
 +  * permissions reset on /hdfs, not sure why
 +  * also use /sanscratch/zookeepers
 +  * some more notes below
 +
 +
 +<code>
 +
 +
 +install.packages('rJava')
 +install.packages("int64")
 +install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))
 +
 +wget http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.9.8.tar.gz
 +wget -O rmr-2.2.0.tar.gz http://goo.gl/bhCU6
 +wget -O rhdfs_1.0.5.tar.gz https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz?raw=true
 +
 +R CMD INSTALL Rcpp_0.9.8.tar.gz
 +R CMD INSTALL rmr-2.2.0.tar.gz
 +R CMD INSTALL rhdfs_1.0.5.tar.gz
 +R CMD INSTALL rhbase_1.2.0.tar.gz
 +
 +yum install openssl098e openssl openssl-devel flex boost ruby ruby-libs ruby-devel php php-libs php-devel \
 +automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
 +
 +b2 install --prefix=/usr/local
 +
 +thrift: ./configure --prefix=/usr/local --with-boost=/usr/local; make
 +make install
 +
 +cp -p /usr/local/lib/libthrift-0.9.0.so /usr/lib/
 +cd /usr/lib; ln -s libthrift-0.9.0.so libthrift.so
 +
 +SKIP (nasty replaced with straight copy, could go to nodes)
 +http://www.cpan.org
 +'o conf commit'
 +cpan> install Hadoop::Streaming 
 +
 +whitetail only, unpack hbase, edit conf/hbase-site.xml, add to /etc/rc.local
 +also edit conf/regionservers
 +copy /usr/local/hbase-version-dir to nodes:/usr/local
 +
 +  <property>
 +    <name>hbase.zookeeper.quorum</name>
 +    <value>example1,example2,example3</value>
 +    <description>The directory shared by RegionServers.
 +    </description>
 +  </property>
 +  <property>
 +    <name>hbase.zookeeper.property.dataDir</name>
 +    <value>/export/zookeeper</value>
 +    <description>Property from ZooKeeper's config zoo.cfg.
 +    The directory where the snapshot is stored.
 +    </description>
 +  </property>
 +
 +
 +</code>
 +
 +
 +
  
 ==== Perl Hadoop's native Streaming ==== ==== Perl Hadoop's native Streaming ====
Line 297: Line 538:
  
 ==== Perl Hadoop's native Streaming #2 ==== ==== Perl Hadoop's native Streaming #2 ====
 +
 +Adopted from [[http://autofei.wordpress.com/category/java/hadoop-code/]]
 +
 +  * Create vectors X = [x1,x2, ...] and Y = [y1,y2, ...]
 +  * And solve the product Z = [x1*y1, x2*y2, ...]
 +
 +First, do this twice in shell
  
 <code> <code>
Line 305: Line 553:
 </code> </code>
  
 +The we'll use the mapper 
  
 +<code>
 +#!/usr/bin/perl
 + 
 +# convert comma delimited to tab delimited
 +
 +while($line=<STDIN>){
 +        @fields = split(/,/, $line);
 +        if ($fields[0] eq '#') { next;}
 +        if($fields[0] && $fields[1]){
 +                print "$fields[0]\t$fields[1]";
 +        }
 +}
 +</code>
 +
 +And the reducer from the web site
 +
 +<code>
 +#!/usr/bin/perl
 +
 +$lastKey="";
 +$product=1;
 +$count=0;
 +
 +while($line=<STDIN>){
 +@fields=split(/\t/, $line);
 +$key = $fields[0];
 +$value = $fields[1];
 +if($lastKey ne "" && $key ne $lastKey){
 +if($count==2){
 +print "$lastKey\t$product\n";
 +}
 +$product=$value;
 +$lastKey=$key;
 +$count=1;
 +}
 +else{
 +$product=$product*$value;
 +$lastKey=$key;
 +$count++;
 +}
 +}
 +#the last key
 +if($count==2){
 +print "$lastKey\t$product\n";
 +</code>
 +
 +And submit the job
 +
 +<code>
 + hadoop jar \
 +/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \
 +-input /tmp/v_data.txt  -output /tmp/v.out \
 +-file ~/v_mapper.pl -mapper ~/v_mapper.pl \
 +-file ~/v_reducer.pl -reducer ~/v_reducer.pl 
 +</code>
 +
 +And that works.
  
  
 ==== Perl Hadoop::Streaming ==== ==== Perl Hadoop::Streaming ====
 +
 +  * All nodes
 +
  
   * [[http://search.cpan.org/~spazm/Hadoop-Streaming-0.122420/lib/Hadoop/Streaming.pm]]   * [[http://search.cpan.org/~spazm/Hadoop-Streaming-0.122420/lib/Hadoop/Streaming.pm]]
cluster/115.1369746245.txt.gz ยท Last modified: 2013/05/28 09:04 by hmeij