cluster:115
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:115 [2013/05/25 14:14] – [Perl Hadoop's native Streaming] hmeij | cluster:115 [2013/09/10 19:04] (current) – [Rhadoop] hmeij | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| **[[cluster: | **[[cluster: | ||
| - | ===== Use Hadoop | + | ===== Use Hadoop Cluster ===== |
| [[cluster: | [[cluster: | ||
| Line 186: | Line 186: | ||
| * a bit involved ... [[http:// | * a bit involved ... [[http:// | ||
| + | |||
| + | Here are my steps to get this working (with lots of Ross's help) for rmr2 and rhdfs installation. Do this on all nodes. | ||
| + | |||
| + | * Add EPEL repository to your yum installation, | ||
| + | * yum install R, which pulls in | ||
| + | |||
| + | < | ||
| + | R-core-3.0.0-2.el6.x86_64 | ||
| + | R-java-devel-3.0.0-2.el6.x86_64 | ||
| + | R-devel-3.0.0-2.el6.x86_64 | ||
| + | R-core-devel-3.0.0-2.el6.x86_64 | ||
| + | R-java-3.0.0-2.el6.x86_64 | ||
| + | R-3.0.0-2.el6.x86_64 | ||
| + | </ | ||
| + | |||
| + | |||
| + | Make sure java is installed properly (the one you used for Hadoop itself) and set ENV in / | ||
| + | |||
| + | < | ||
| + | export JAVA_HOME="/ | ||
| + | export PATH=/ | ||
| + | |||
| + | export HADOOP_HOME=/ | ||
| + | export HADOOP_CMD=/ | ||
| + | export HADOOP_STREAMING=/ | ||
| + | </ | ||
| + | |||
| + | I noticed that at soome point openJDK is reinstalled so I managed these links | ||
| + | |||
| + | < | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx | ||
| + | lrwxrwxrwx. 1 root root 28 May 15 14:56 / | ||
| + | </ | ||
| + | |||
| + | So if commands '' | ||
| + | |||
| + | < | ||
| + | # at OS | ||
| + | R CMD javareconf | ||
| + | # in R | ||
| + | install.packages(' | ||
| + | </ | ||
| + | |||
| + | You could also set java in this file: $HADOOP_HOME/ | ||
| + | |||
| + | When that successful, add dependencies: | ||
| + | |||
| + | See the following files for current lists of dependencies: | ||
| + | |||
| + | [[https:// | ||
| + | [[https:// | ||
| + | |||
| + | Enter R and issues the command | ||
| + | |||
| + | < | ||
| + | install.packages(c(" | ||
| + | </ | ||
| + | |||
| + | If Rccp is a problem: locate and install an older version of Rcpp in the CRAN archives (http:// | ||
| + | |||
| + | < | ||
| + | # in R | ||
| + | install.packages(" | ||
| + | # at OS | ||
| + | wget http:// | ||
| + | R CMD INSTALL Rcpp_0.9.8.tar.gz | ||
| + | </ | ||
| + | |||
| + | Finally the RHadoop stuff, at The OS level | ||
| + | |||
| + | < | ||
| + | wget -O rmr-2.2.0.tar.gz http:// | ||
| + | wget -O rhdfs_1.0.5.tar.gz https:// | ||
| + | |||
| + | R CMD INSTALL rmr-2.2.0.tar.gz | ||
| + | R CMD INSTALL rhdfs_1.0.5.tar.gz | ||
| + | </ | ||
| + | |||
| + | Verify | ||
| + | |||
| + | < | ||
| + | Type ' | ||
| + | |||
| + | > library(rmr2) | ||
| + | Loading required package: Rcpp | ||
| + | Loading required package: RJSONIO | ||
| + | Loading required package: digest | ||
| + | Loading required package: functional | ||
| + | Loading required package: stringr | ||
| + | Loading required package: plyr | ||
| + | Loading required package: reshape2 | ||
| + | > library(rhdfs) | ||
| + | Loading required package: rJava | ||
| + | |||
| + | HADOOP_CMD=/ | ||
| + | |||
| + | Be sure to run hdfs.init() | ||
| + | > sessionInfo() | ||
| + | R version 3.0.0 (2013-04-03) | ||
| + | Platform: x86_64-redhat-linux-gnu (64-bit) | ||
| + | |||
| + | locale: | ||
| + | [1] LC_CTYPE=en_US.UTF-8 | ||
| + | [3] LC_TIME=en_US.UTF-8 | ||
| + | [5] LC_MONETARY=en_US.UTF-8 | ||
| + | [7] LC_PAPER=C | ||
| + | [9] LC_ADDRESS=C | ||
| + | [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C | ||
| + | |||
| + | attached base packages: | ||
| + | [1] stats | ||
| + | |||
| + | other attached packages: | ||
| + | [1] rhdfs_1.0.5 | ||
| + | [6] stringr_0.6.2 | ||
| + | |||
| + | </ | ||
| + | |||
| + | Test | ||
| + | |||
| + | Tutorial documentation: | ||
| + | |||
| + | R script: | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | library(rmr2) | ||
| + | library(rhdfs) | ||
| + | hdfs.init() | ||
| + | |||
| + | small.ints = to.dfs(1: | ||
| + | mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2)) | ||
| + | </ | ||
| + | |||
| + | Then Hbase for Rhbase: | ||
| + | |||
| + | [[http:// | ||
| + | |||
| + | But first Trift, the language interface to the database Hbase: | ||
| + | |||
| + | < | ||
| + | yum install openssl098e | ||
| + | </ | ||
| + | |||
| + | Download Trift: [[http:// | ||
| + | |||
| + | < | ||
| + | yum install byacc -y | ||
| + | yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel | ||
| + | |||
| + | ./configure | ||
| + | make | ||
| + | make install | ||
| + | export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/ | ||
| + | pkg-config --cflags thrift | ||
| + | cp -p / | ||
| + | |||
| + | HBASE_ROOT/ | ||
| + | lsof -i:9090 that is server, port 9095 is monitor | ||
| + | |||
| + | </ | ||
| + | |||
| + | Configure for distributed environment: | ||
| + | |||
| + | * used 3 zookeepers with quorum, see config example online | ||
| + | * start with rolling_restart, | ||
| + | * /hbase owened by root:root | ||
| + | * permissions reset on /hdfs, not sure why | ||
| + | * also use / | ||
| + | * some more notes below | ||
| + | |||
| + | |||
| + | < | ||
| + | |||
| + | |||
| + | install.packages(' | ||
| + | install.packages(" | ||
| + | install.packages(c(" | ||
| + | |||
| + | wget http:// | ||
| + | wget -O rmr-2.2.0.tar.gz http:// | ||
| + | wget -O rhdfs_1.0.5.tar.gz https:// | ||
| + | |||
| + | R CMD INSTALL Rcpp_0.9.8.tar.gz | ||
| + | R CMD INSTALL rmr-2.2.0.tar.gz | ||
| + | R CMD INSTALL rhdfs_1.0.5.tar.gz | ||
| + | R CMD INSTALL rhbase_1.2.0.tar.gz | ||
| + | |||
| + | yum install openssl098e openssl openssl-devel flex boost ruby ruby-libs ruby-devel php php-libs php-devel \ | ||
| + | automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel | ||
| + | |||
| + | b2 install --prefix=/ | ||
| + | |||
| + | thrift: ./configure --prefix=/ | ||
| + | make install | ||
| + | |||
| + | cp -p / | ||
| + | cd /usr/lib; ln -s libthrift-0.9.0.so libthrift.so | ||
| + | |||
| + | SKIP (nasty replaced with straight copy, could go to nodes) | ||
| + | http:// | ||
| + | 'o conf commit' | ||
| + | cpan> install Hadoop:: | ||
| + | |||
| + | whitetail only, unpack hbase, edit conf/ | ||
| + | also edit conf/ | ||
| + | copy / | ||
| + | |||
| + | < | ||
| + | < | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | < | ||
| + | < | ||
| + | < | ||
| + | < | ||
| + | The directory where the snapshot is stored. | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | |||
| + | </ | ||
| + | |||
| + | |||
| + | |||
| ==== Perl Hadoop' | ==== Perl Hadoop' | ||
| Line 296: | Line 537: | ||
| </ | </ | ||
| + | ==== Perl Hadoop' | ||
| + | Adopted from [[http:// | ||
| + | |||
| + | * Create vectors X = [x1,x2, ...] and Y = [y1,y2, ...] | ||
| + | * And solve the product Z = [x1*y1, x2*y2, ...] | ||
| + | |||
| + | First, do this twice in shell | ||
| + | |||
| + | < | ||
| + | for i in `seq 1 1000000` | ||
| + | > do | ||
| + | > echo -e " | ||
| + | > done | ||
| + | </ | ||
| + | |||
| + | The we'll use the mapper | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | # convert comma delimited to tab delimited | ||
| + | |||
| + | while($line=< | ||
| + | @fields = split(/,/, $line); | ||
| + | if ($fields[0] eq '#' | ||
| + | if($fields[0] && $fields[1]){ | ||
| + | print " | ||
| + | } | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | And the reducer from the web site | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | $lastKey=""; | ||
| + | $product=1; | ||
| + | $count=0; | ||
| + | |||
| + | while($line=< | ||
| + | @fields=split(/ | ||
| + | $key = $fields[0]; | ||
| + | $value = $fields[1]; | ||
| + | if($lastKey ne "" | ||
| + | if($count==2){ | ||
| + | print " | ||
| + | } | ||
| + | $product=$value; | ||
| + | $lastKey=$key; | ||
| + | $count=1; | ||
| + | } | ||
| + | else{ | ||
| + | $product=$product*$value; | ||
| + | $lastKey=$key; | ||
| + | $count++; | ||
| + | } | ||
| + | } | ||
| + | #the last key | ||
| + | if($count==2){ | ||
| + | print " | ||
| + | </ | ||
| + | |||
| + | And submit the job | ||
| + | |||
| + | < | ||
| + | | ||
| + | / | ||
| + | -input / | ||
| + | -file ~/ | ||
| + | -file ~/ | ||
| + | </ | ||
| + | |||
| + | And that works. | ||
| ==== Perl Hadoop:: | ==== Perl Hadoop:: | ||
| + | |||
| + | * All nodes | ||
| + | |||
| * [[http:// | * [[http:// | ||
| Line 308: | Line 626: | ||
| cpan> install Hadoop:: | cpan> install Hadoop:: | ||
| - | [root@qactweet1 hmeij07]# cat hadoop_streaming.txt | + | |
| Installing / | Installing / | ||
| Installing / | Installing / | ||
| Line 319: | Line 637: | ||
| Installing / | Installing / | ||
| - | < | + | </code> |
| * How to use this? | * How to use this? | ||
cluster/115.1369491274.txt.gz · Last modified: by hmeij
