This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:115 [2013/05/28 13:06] hmeij [Perl Hadoop's native Streaming #2] |
cluster:115 [2013/09/10 19:04] (current) hmeij [Rhadoop] |
||
---|---|---|---|
Line 2: | Line 2: | ||
**[[cluster: | **[[cluster: | ||
- | ===== Use Hadoop | + | ===== Use Hadoop Cluster ===== |
[[cluster: | [[cluster: | ||
Line 186: | Line 186: | ||
* a bit involved ... [[http:// | * a bit involved ... [[http:// | ||
+ | |||
+ | Here are my steps to get this working (with lots of Ross's help) for rmr2 and rhdfs installation. Do this on all nodes. | ||
+ | |||
+ | * Add EPEL repository to your yum installation, | ||
+ | * yum install R, which pulls in | ||
+ | |||
+ | < | ||
+ | R-core-3.0.0-2.el6.x86_64 | ||
+ | R-java-devel-3.0.0-2.el6.x86_64 | ||
+ | R-devel-3.0.0-2.el6.x86_64 | ||
+ | R-core-devel-3.0.0-2.el6.x86_64 | ||
+ | R-java-3.0.0-2.el6.x86_64 | ||
+ | R-3.0.0-2.el6.x86_64 | ||
+ | </ | ||
+ | |||
+ | |||
+ | Make sure java is installed properly (the one you used for Hadoop itself) and set ENV in / | ||
+ | |||
+ | < | ||
+ | export JAVA_HOME="/ | ||
+ | export PATH=/ | ||
+ | |||
+ | export HADOOP_HOME=/ | ||
+ | export HADOOP_CMD=/ | ||
+ | export HADOOP_STREAMING=/ | ||
+ | </ | ||
+ | |||
+ | I noticed that at soome point openJDK is reinstalled so I managed these links | ||
+ | |||
+ | < | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx | ||
+ | lrwxrwxrwx. 1 root root 28 May 15 14:56 / | ||
+ | </ | ||
+ | |||
+ | So if commands '' | ||
+ | |||
+ | < | ||
+ | # at OS | ||
+ | R CMD javareconf | ||
+ | # in R | ||
+ | install.packages(' | ||
+ | </ | ||
+ | |||
+ | You could also set java in this file: $HADOOP_HOME/ | ||
+ | |||
+ | When that successful, add dependencies: | ||
+ | |||
+ | See the following files for current lists of dependencies: | ||
+ | |||
+ | [[https:// | ||
+ | [[https:// | ||
+ | |||
+ | Enter R and issues the command | ||
+ | |||
+ | < | ||
+ | install.packages(c(" | ||
+ | </ | ||
+ | |||
+ | If Rccp is a problem: locate and install an older version of Rcpp in the CRAN archives (http:// | ||
+ | |||
+ | < | ||
+ | # in R | ||
+ | install.packages(" | ||
+ | # at OS | ||
+ | wget http:// | ||
+ | R CMD INSTALL Rcpp_0.9.8.tar.gz | ||
+ | </ | ||
+ | |||
+ | Finally the RHadoop stuff, at The OS level | ||
+ | |||
+ | < | ||
+ | wget -O rmr-2.2.0.tar.gz http:// | ||
+ | wget -O rhdfs_1.0.5.tar.gz https:// | ||
+ | |||
+ | R CMD INSTALL rmr-2.2.0.tar.gz | ||
+ | R CMD INSTALL rhdfs_1.0.5.tar.gz | ||
+ | </ | ||
+ | |||
+ | Verify | ||
+ | |||
+ | < | ||
+ | Type ' | ||
+ | |||
+ | > library(rmr2) | ||
+ | Loading required package: Rcpp | ||
+ | Loading required package: RJSONIO | ||
+ | Loading required package: digest | ||
+ | Loading required package: functional | ||
+ | Loading required package: stringr | ||
+ | Loading required package: plyr | ||
+ | Loading required package: reshape2 | ||
+ | > library(rhdfs) | ||
+ | Loading required package: rJava | ||
+ | |||
+ | HADOOP_CMD=/ | ||
+ | |||
+ | Be sure to run hdfs.init() | ||
+ | > sessionInfo() | ||
+ | R version 3.0.0 (2013-04-03) | ||
+ | Platform: x86_64-redhat-linux-gnu (64-bit) | ||
+ | |||
+ | locale: | ||
+ | [1] LC_CTYPE=en_US.UTF-8 | ||
+ | [3] LC_TIME=en_US.UTF-8 | ||
+ | [5] LC_MONETARY=en_US.UTF-8 | ||
+ | [7] LC_PAPER=C | ||
+ | [9] LC_ADDRESS=C | ||
+ | [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C | ||
+ | |||
+ | attached base packages: | ||
+ | [1] stats | ||
+ | |||
+ | other attached packages: | ||
+ | [1] rhdfs_1.0.5 | ||
+ | [6] stringr_0.6.2 | ||
+ | |||
+ | </ | ||
+ | |||
+ | Test | ||
+ | |||
+ | Tutorial documentation: | ||
+ | |||
+ | R script: | ||
+ | |||
+ | < | ||
+ | # | ||
+ | |||
+ | library(rmr2) | ||
+ | library(rhdfs) | ||
+ | hdfs.init() | ||
+ | |||
+ | small.ints = to.dfs(1: | ||
+ | mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2)) | ||
+ | </ | ||
+ | |||
+ | Then Hbase for Rhbase: | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | But first Trift, the language interface to the database Hbase: | ||
+ | |||
+ | < | ||
+ | yum install openssl098e | ||
+ | </ | ||
+ | |||
+ | Download Trift: [[http:// | ||
+ | |||
+ | < | ||
+ | yum install byacc -y | ||
+ | yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel | ||
+ | |||
+ | ./configure | ||
+ | make | ||
+ | make install | ||
+ | export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/ | ||
+ | pkg-config --cflags thrift | ||
+ | cp -p / | ||
+ | |||
+ | HBASE_ROOT/ | ||
+ | lsof -i:9090 that is server, port 9095 is monitor | ||
+ | |||
+ | </ | ||
+ | |||
+ | Configure for distributed environment: | ||
+ | |||
+ | * used 3 zookeepers with quorum, see config example online | ||
+ | * start with rolling_restart, | ||
+ | * /hbase owened by root:root | ||
+ | * permissions reset on /hdfs, not sure why | ||
+ | * also use / | ||
+ | * some more notes below | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | |||
+ | install.packages(' | ||
+ | install.packages(" | ||
+ | install.packages(c(" | ||
+ | |||
+ | wget http:// | ||
+ | wget -O rmr-2.2.0.tar.gz http:// | ||
+ | wget -O rhdfs_1.0.5.tar.gz https:// | ||
+ | |||
+ | R CMD INSTALL Rcpp_0.9.8.tar.gz | ||
+ | R CMD INSTALL rmr-2.2.0.tar.gz | ||
+ | R CMD INSTALL rhdfs_1.0.5.tar.gz | ||
+ | R CMD INSTALL rhbase_1.2.0.tar.gz | ||
+ | |||
+ | yum install openssl098e openssl openssl-devel flex boost ruby ruby-libs ruby-devel php php-libs php-devel \ | ||
+ | automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel | ||
+ | |||
+ | b2 install --prefix=/ | ||
+ | |||
+ | thrift: ./configure --prefix=/ | ||
+ | make install | ||
+ | |||
+ | cp -p / | ||
+ | cd /usr/lib; ln -s libthrift-0.9.0.so libthrift.so | ||
+ | |||
+ | SKIP (nasty replaced with straight copy, could go to nodes) | ||
+ | http:// | ||
+ | 'o conf commit' | ||
+ | cpan> install Hadoop:: | ||
+ | |||
+ | whitetail only, unpack hbase, edit conf/ | ||
+ | also edit conf/ | ||
+ | copy / | ||
+ | |||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | The directory where the snapshot is stored. | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
==== Perl Hadoop' | ==== Perl Hadoop' | ||
Line 300: | Line 541: | ||
Adopted from [[http:// | Adopted from [[http:// | ||
- | X = [x1,x2, ...] and Y = [y1,y2, ...] and we wish to know the product Z = [x1*y1, x2*y2, ...] | + | * Create vectors |
+ | * And solve the product Z = [x1*y1, x2*y2, ...] | ||
- | First | + | First, do this twice in shell |
< | < | ||
Line 311: | Line 553: | ||
</ | </ | ||
+ | The we'll use the mapper | ||
+ | < | ||
+ | # | ||
+ | |||
+ | # convert comma delimited to tab delimited | ||
+ | |||
+ | while($line=< | ||
+ | @fields = split(/,/, $line); | ||
+ | if ($fields[0] eq '#' | ||
+ | if($fields[0] && $fields[1]){ | ||
+ | print " | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | And the reducer from the web site | ||
+ | |||
+ | < | ||
+ | # | ||
+ | |||
+ | $lastKey=""; | ||
+ | $product=1; | ||
+ | $count=0; | ||
+ | |||
+ | while($line=< | ||
+ | @fields=split(/ | ||
+ | $key = $fields[0]; | ||
+ | $value = $fields[1]; | ||
+ | if($lastKey ne "" | ||
+ | if($count==2){ | ||
+ | print " | ||
+ | } | ||
+ | $product=$value; | ||
+ | $lastKey=$key; | ||
+ | $count=1; | ||
+ | } | ||
+ | else{ | ||
+ | $product=$product*$value; | ||
+ | $lastKey=$key; | ||
+ | $count++; | ||
+ | } | ||
+ | } | ||
+ | #the last key | ||
+ | if($count==2){ | ||
+ | print " | ||
+ | </ | ||
+ | |||
+ | And submit the job | ||
+ | |||
+ | < | ||
+ | | ||
+ | / | ||
+ | -input / | ||
+ | -file ~/ | ||
+ | -file ~/ | ||
+ | </ | ||
+ | |||
+ | And that works. | ||
==== Perl Hadoop:: | ==== Perl Hadoop:: | ||
+ | |||
+ | * All nodes | ||
+ | |||
* [[http:// | * [[http:// |