Differences

This shows you the differences between two versions of the page.

--- cluster:115 [2013/05/24 13:57]
hmeij [Perl]
+++ cluster:115 [2013/09/09 14:36]
hmeij
@@ Line 187: / Line 187: @@
   * a bit involved ... [[http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]]
-==== Perl ====
+Here are my steps to get this working (with lots of Ross's help) for rmr2 and rhdfs installation. Do this on all nodes.
+  * Add EPEL repository to your yum installation, then
+  * yum install R, which pulls in
+<code>
+R-core-3.0.0-2.el6.x86_64
+R-java-devel-3.0.0-2.el6.x86_64
+R-devel-3.0.0-2.el6.x86_64
+R-core-devel-3.0.0-2.el6.x86_64
+R-java-3.0.0-2.el6.x86_64
+R-3.0.0-2.el6.x86_64
+</code>
+Make sure java is installed properly (the one you used for Hadoop itself) and set ENV in /etc/profile
+<code>
+export JAVA_HOME="/usr/java/latest"
+export PATH=/usr/java/latest/bin:$PATH
+export HADOOP_HOME=/usr/lib/hadoop-0.20
+export HADOOP_CMD=/usr/bin/hadoop
+export HADOOP_STREAMING=/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar
+</code>
+I noticed that at soome point openJDK is reinstalled so I managed these links
+<code>
+lrwxrwxrwx  1 root root 24 May 27 10:41 /usr/bin/jar -> /usr/java/latest/bin/jar
+lrwxrwxrwx  1 root root 21 May 27 09:47 /usr/bin/jar-alt -> /etc/alternatives/jar
+lrwxrwxrwx  1 root root 30 May 27 10:41 /usr/bin/jarsigner -> /usr/java/latest/bin/jarsigner
+lrwxrwxrwx  1 root root 27 May 27 09:47 /usr/bin/jarsigner-alt -> /etc/alternatives/jarsigner
+lrwxrwxrwx  1 root root 25 May 27 10:35 /usr/bin/java -> /usr/java/latest/bin/java
+lrwxrwxrwx  1 root root 22 May 27 09:47 /usr/bin/java-alt -> /etc/alternatives/java
+lrwxrwxrwx  1 root root 26 May 27 10:38 /usr/bin/javac -> /usr/java/latest/bin/javac
+lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javac-alt -> /etc/alternatives/javac
+lrwxrwxrwx  1 root root 25 May 27 09:47 /usr/bin/javadoc -> /etc/alternatives/javadoc
+lrwxrwxrwx  1 root root 26 May 28 09:37 /usr/bin/javah -> /usr/java/latest/bin/javah
+lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javah-alt -> /etc/alternatives/javah
+lrwxrwxrwx  1 root root 26 May 27 10:39 /usr/bin/javap -> /usr/java/latest/bin/javap
+lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javap-alt -> /etc/alternatives/javap
+lrwxrwxrwx  1 root root 27 May 27 10:40 /usr/bin/javaws -> /usr/java/latest/bin/javaws
+lrwxrwxrwx. 1 root root 28 May 15 14:56 /usr/bin/javaws-alt -> /usr/java/default/bin/javaws
+</code>
+So if commands ''which java'' and ''java -version'' return the proper information, reconfigure java in R. At the OS prompt
+<code>
+# at OS
+R CMD javareconf
+# in R
+install.packages('rJava')
+</code>
+You could also set java in this file: $HADOOP_HOME/conf/hadoop-env.sh
+When that successful, add dependencies:
+See the following files for current lists of dependencies:
+[[https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/DESCRIPTION]]
+[[https://github.com/RevolutionAnalytics/rhdfs/blob/master/pkg/DESCRIPTION]]
+Enter R and issues the command
+<code>
+install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))
+</code>
+If Rccp is a problem: locate and install an older version of Rcpp in the CRAN archives (http://cran.r-project.org/src/contrib/Archive/Rcpp/):
+<code>
+# in R
+install.packages("int64")
+# at OS
+wget http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.9.8.tar.gz
+R CMD INSTALL Rcpp_0.9.8.tar.gz
+</code>
+Finally the RHadoop stuff, at The OS level
+<code>
+wget -O rmr-2.2.0.tar.gz http://goo.gl/bhCU6
+wget -O rhdfs_1.0.5.tar.gz https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz?raw=true
+R CMD INSTALL rmr-2.2.0.tar.gz
+R CMD INSTALL rhdfs_1.0.5.tar.gz
+</code>
+Verify
+<code>
+Type 'q()' to quit R.
+> library(rmr2)
+Loading required package: Rcpp
+Loading required package: RJSONIO
+Loading required package: digest
+Loading required package: functional
+Loading required package: stringr
+Loading required package: plyr
+Loading required package: reshape2
+> library(rhdfs)
+Loading required package: rJava
+HADOOP_CMD=/usr/bin/hadoop
+Be sure to run hdfs.init()
+> sessionInfo()
+R version 3.0.0 (2013-04-03)
+Platform: x86_64-redhat-linux-gnu (64-bit)
+locale:
+ [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=C                 LC_NAME=C
+ [9] LC_ADDRESS=C               LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base
+other attached packages:
+ [1] rhdfs_1.0.5    rJava_0.9-4    rmr2_2.2.0     reshape2_1.2.2 plyr_1.8
+ [6] stringr_0.6.2  functional_0.4 digest_0.6.3   RJSONIO_1.0-3  Rcpp_0.10.3
+</code>
+Test
+Tutorial documentation: [[https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md]]
+R script:
+<code>
+#!/usr/bin/Rscript
+library(rmr2)
+library(rhdfs)
+hdfs.init()
+small.ints = to.dfs(1:1000)
+mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
+</code>
+Then Hbase for Rhbase:
+[[http://hbase.apache.org/book/configuration.html]]
+But first Trift, the language interface to the database Hbase:
+<code>
+yum install openssl098e
+</code>
+Download Trift: [[http://thrift.apache.org/download/]]
+<code>
+yum install byacc -y
+yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
+./configure
+make
+make install
+export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
+pkg-config --cflags thrift
+cp -p /usr/local/lib/libthrift-0.9.0.so /usr/lib/
+HBASE_ROOT/bin/hbase thrift start &
+lsof -i:9090 that is server, port 9095 is monitor
+</code>
+Configure for distributed environment: [[http://hbase.apache.org/book/standalone_dist.html#standalone]]
+  * used 3 zookeepers with quorum, see config example online
+  * start with rolling_restart, the start & stop have a timing issue
+  * /hbase owened by root:root
+  * permissions reset on /hdfs, not sure why
+  * also use /sanscratch/zookeepers
+  *
+==== Perl Hadoop's native Streaming ====
+First we create a script for the map step, note that this only print individual words back
+<code>
+vi wc_mapper.pl  # make user executable and add contents below
+#!/usr/bin/perl -w
+# read in a file of text, print word by word
+while(<>) {
+        @line = split;
+        foreach $word (@line) {
+                print "$word\n";
+        }
+}
+</code>
+Next we create a script for the reduce step, note that the use of a hash avoids the sort
+<code>
+vi wc_reducer.pl  # make user executable and add contents below
+#!/usr/bin/perl -w
+# store words in hash and print key-value results
+while (<>) {
+        chomp;
+        $seen{$_}++;
+}
+foreach $key (keys %seen) {
+        print "$seen{$key} $key\n";
+}
+</code>
+Next test this on the command line
+<code>
+perl wc_mapper.pl HF.txt | perl wc_reducer.pl | sort -rn | head
+and
+the
+a
+to
+I
+was
+of
+it
+he
+in
+</code>
+Load the text input file up to Hadoop's HDFS and submit the job
+<code>
+hadoop fs -put  HF.txt /tmp
+hadoop dfs -ls /tmp
+# results
+Found 2 items
+-rw-r--r--   3 hmeij07   supergroup     459378 2013-05-23 14:24 /tmp/DS.txt.gz
+-rw-r--r--   3 hmeij07   supergroup     610155 2013-05-23 14:24 /tmp/HF.txt
+# submit, note that -mapper and -reducer options are paired with a -file option
+# pointing to files in our home dirrectory (not HDFS). these files will be copied
+# to each datanode for execution, and then the results will be tabulated
+hadoop jar \
+/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \
+-input /tmp/HF.txt -output /tmp/HF.out \
+-file ~/wc_mapper.pl -mapper ~/wc_mapper.pl \
+-file ~/wc_reducer.pl -reducer ~/wc_reducer.pl
+hadoop fs -ls /tmp/HF.out
+# results
+Found 3 items
+-rw-r--r--   3 hmeij07 supergroup          0 2013-05-24 15:10 /tmp/HF.out/_SUCCESS
+drwxrwxrwt   - hmeij07 supergroup          0 2013-05-24 15:10 /tmp/HF.out/_logs
+-rw-r--r--   3 hmeij07 supergroup     161788 2013-05-24 15:10 /tmp/HF.out/part-00000
+hadoop fs -cat /tmp/HF.out/part-00000 | sort -rn | head
+# results
+and
+the
+a
+to
+I
+was
+of
+it
+he
+in
+# clean up space
+</code>
+==== Perl Hadoop's native Streaming #2 ====
+Adopted from [[http://autofei.wordpress.com/category/java/hadoop-code/]]
+  * Create vectors X = [x1,x2, ...] and Y = [y1,y2, ...]
+  * And solve the product Z = [x1*y1, x2*y2, ...]
+First, do this twice in shell
+<code>
+for i in `seq 1 1000000`
+> do
+> echo -e "$i,$RANDOM" >> v_data_large.txt
+> done
+</code>
+The we'll use the mapper
+<code>
+#!/usr/bin/perl
+# convert comma delimited to tab delimited
+while($line=<STDIN>){
+        @fields = split(/,/, $line);
+        if ($fields[0] eq '#') { next;}
+        if($fields[0] && $fields[1]){
+                print "$fields[0]\t$fields[1]";
+        }
+}
+</code>
+And the reducer from the web site
+<code>
+#!/usr/bin/perl
+$lastKey="";
+$product=1;
+$count=0;
+while($line=<STDIN>){
+@fields=split(/\t/, $line);
+$key = $fields[0];
+$value = $fields[1];
+if($lastKey ne "" && $key ne $lastKey){
+if($count==2){
+print "$lastKey\t$product\n";
+}
+$product=$value;
+$lastKey=$key;
+$count=1;
+}
+else{
+$product=$product*$value;
+$lastKey=$key;
+$count++;
+}
+}
+#the last key
+if($count==2){
+print "$lastKey\t$product\n";
+</code>
+And submit the job
+<code>
+ hadoop jar \
+/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \
+-input /tmp/v_data.txt  -output /tmp/v.out \
+-file ~/v_mapper.pl -mapper ~/v_mapper.pl \
+-file ~/v_reducer.pl -reducer ~/v_reducer.pl
+</code>
+And that works.
+==== Perl Hadoop::Streaming ====
+  * All nodes
   * [[http://search.cpan.org/~spazm/Hadoop-Streaming-0.122420/lib/Hadoop/Streaming.pm]]
-  * cpan: install Hadoop::Streaming
+<code>
+yum install cpan
+cpan> install Hadoop::Streaming
+Installing /usr/local/share/perl5/Hadoop/Streaming.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Mapper.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Combiner.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Role/Iterator.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Role/Emitter.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input/Iterator.pm
+Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input/ValuesIterator.pm
+</code>
+  * How to use this?
 ==== MySQL ====

DokuWiki

User Tools

Site Tools

Differences

Page Tools