Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:115 [DokuWiki]

User Tools

Site Tools


cluster:115

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:115 [2013/05/24 13:57]
hmeij [Perl]
cluster:115 [2013/09/09 14:36]
hmeij
Line 187: Line 187:
   * a bit involved ... [[http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]]   * a bit involved ... [[http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]]
  
-==== Perl ====+Here are my steps to get this working (with lots of Ross's help) for rmr2 and rhdfs installation. Do this on all nodes. 
 + 
 +  * Add EPEL repository to your yum installation, then 
 +  * yum install R, which pulls in  
 + 
 +<code> 
 +R-core-3.0.0-2.el6.x86_64 
 +R-java-devel-3.0.0-2.el6.x86_64 
 +R-devel-3.0.0-2.el6.x86_64 
 +R-core-devel-3.0.0-2.el6.x86_64 
 +R-java-3.0.0-2.el6.x86_64 
 +R-3.0.0-2.el6.x86_64 
 +</code> 
 + 
 + 
 +Make sure java is installed properly (the one you used for Hadoop itself) and set ENV in /etc/profile 
 + 
 +<code> 
 +export JAVA_HOME="/usr/java/latest" 
 +export PATH=/usr/java/latest/bin:$PATH 
 + 
 +export HADOOP_HOME=/usr/lib/hadoop-0.20 
 +export HADOOP_CMD=/usr/bin/hadoop 
 +export HADOOP_STREAMING=/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar 
 +</code> 
 + 
 +I noticed that at soome point openJDK is reinstalled so I managed these links 
 + 
 +<code> 
 +lrwxrwxrwx  1 root root 24 May 27 10:41 /usr/bin/jar -> /usr/java/latest/bin/jar 
 +lrwxrwxrwx  1 root root 21 May 27 09:47 /usr/bin/jar-alt -> /etc/alternatives/jar 
 +lrwxrwxrwx  1 root root 30 May 27 10:41 /usr/bin/jarsigner -> /usr/java/latest/bin/jarsigner 
 +lrwxrwxrwx  1 root root 27 May 27 09:47 /usr/bin/jarsigner-alt -> /etc/alternatives/jarsigner 
 +lrwxrwxrwx  1 root root 25 May 27 10:35 /usr/bin/java -> /usr/java/latest/bin/java 
 +lrwxrwxrwx  1 root root 22 May 27 09:47 /usr/bin/java-alt -> /etc/alternatives/java 
 +lrwxrwxrwx  1 root root 26 May 27 10:38 /usr/bin/javac -> /usr/java/latest/bin/javac 
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javac-alt -> /etc/alternatives/javac 
 +lrwxrwxrwx  1 root root 25 May 27 09:47 /usr/bin/javadoc -> /etc/alternatives/javadoc 
 +lrwxrwxrwx  1 root root 26 May 28 09:37 /usr/bin/javah -> /usr/java/latest/bin/javah 
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javah-alt -> /etc/alternatives/javah 
 +lrwxrwxrwx  1 root root 26 May 27 10:39 /usr/bin/javap -> /usr/java/latest/bin/javap 
 +lrwxrwxrwx  1 root root 23 May 27 09:47 /usr/bin/javap-alt -> /etc/alternatives/javap 
 +lrwxrwxrwx  1 root root 27 May 27 10:40 /usr/bin/javaws -> /usr/java/latest/bin/javaws 
 +lrwxrwxrwx. 1 root root 28 May 15 14:56 /usr/bin/javaws-alt -> /usr/java/default/bin/javaws 
 +</code> 
 + 
 +So if commands ''which java'' and ''java -version'' return the proper information, reconfigure java in R. At the OS prompt 
 + 
 +<code> 
 +# at OS 
 +R CMD javareconf 
 +# in R 
 +install.packages('rJava'
 +</code> 
 + 
 +You could also set java in this file: $HADOOP_HOME/conf/hadoop-env.sh 
 + 
 +When that successful, add dependencies: 
 + 
 +See the following files for current lists of dependencies: 
 + 
 +[[https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/DESCRIPTION]] 
 +[[https://github.com/RevolutionAnalytics/rhdfs/blob/master/pkg/DESCRIPTION]] 
 + 
 +Enter R and issues the command 
 + 
 +<code> 
 +install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2")) 
 +</code> 
 + 
 +If Rccp is a problem: locate and install an older version of Rcpp in the CRAN archives (http://cran.r-project.org/src/contrib/Archive/Rcpp/): 
 + 
 +<code> 
 +# in R 
 +install.packages("int64"
 +# at OS 
 +wget http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.9.8.tar.gz 
 +R CMD INSTALL Rcpp_0.9.8.tar.gz 
 +</code> 
 + 
 +Finally the RHadoop stuff, at The OS level 
 + 
 +<code> 
 +wget -O rmr-2.2.0.tar.gz http://goo.gl/bhCU6 
 +wget -O rhdfs_1.0.5.tar.gz https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz?raw=true 
 + 
 +R CMD INSTALL rmr-2.2.0.tar.gz 
 +R CMD INSTALL rhdfs_1.0.5.tar.gz 
 +</code> 
 + 
 +Verify 
 + 
 +<code> 
 +Type 'q()' to quit R. 
 + 
 +> library(rmr2) 
 +Loading required package: Rcpp 
 +Loading required package: RJSONIO 
 +Loading required package: digest 
 +Loading required package: functional 
 +Loading required package: stringr 
 +Loading required package: plyr 
 +Loading required package: reshape2 
 +> library(rhdfs) 
 +Loading required package: rJava 
 + 
 +HADOOP_CMD=/usr/bin/hadoop 
 + 
 +Be sure to run hdfs.init() 
 +> sessionInfo() 
 +R version 3.0.0 (2013-04-03) 
 +Platform: x86_64-redhat-linux-gnu (64-bit) 
 + 
 +locale: 
 + [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
 + [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 
 + [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 + [7] LC_PAPER=C                 LC_NAME=C 
 + [9] LC_ADDRESS=C               LC_TELEPHONE=C 
 +[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
 + 
 +attached base packages: 
 +[1] stats     graphics  grDevices utils     datasets  methods   base 
 + 
 +other attached packages: 
 + [1] rhdfs_1.0.5    rJava_0.9-4    rmr2_2.2.0     reshape2_1.2.2 plyr_1.8 
 + [6] stringr_0.6.2  functional_0.4 digest_0.6.3   RJSONIO_1.0-3  Rcpp_0.10.3 
 + 
 +</code> 
 + 
 +Test 
 + 
 +Tutorial documentation: [[https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md]] 
 + 
 +R script: 
 + 
 +<code> 
 +#!/usr/bin/Rscript 
 + 
 +library(rmr2) 
 +library(rhdfs) 
 +hdfs.init() 
 + 
 +small.ints = to.dfs(1:1000) 
 +mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2)) 
 +</code> 
 + 
 +Then Hbase for Rhbase: 
 + 
 +[[http://hbase.apache.org/book/configuration.html]] 
 + 
 +But first Trift, the language interface to the database Hbase: 
 + 
 +<code> 
 +yum install openssl098e 
 +</code> 
 + 
 +Download Trift: [[http://thrift.apache.org/download/]] 
 + 
 +<code> 
 +yum install byacc -y 
 +yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel 
 + 
 +./configure 
 +make 
 +make install 
 +export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/ 
 +pkg-config --cflags thrift 
 +cp -p /usr/local/lib/libthrift-0.9.0.so /usr/lib/ 
 + 
 +HBASE_ROOT/bin/hbase thrift start & 
 +lsof -i:9090 that is server, port 9095 is monitor 
 + 
 +</code> 
 + 
 +Configure for distributed environment: [[http://hbase.apache.org/book/standalone_dist.html#standalone]] 
 + 
 +  * used 3 zookeepers with quorum, see config example online 
 +  * start with rolling_restart, the start & stop have a timing issue 
 +  * /hbase owened by root:root 
 +  * permissions reset on /hdfs, not sure why 
 +  * also use /sanscratch/zookeepers 
 +  *  
 + 
 + 
 + 
 + 
 + 
 +==== Perl Hadoop's native Streaming ==== 
 + 
 +First we create a script for the map step, note that this only print individual words back 
 + 
 +<code> 
 + 
 +vi wc_mapper.pl  # make user executable and add contents below 
 + 
 +#!/usr/bin/perl -w 
 + 
 +# read in a file of text, print word by word 
 + 
 +while(<>) { 
 + 
 +        @line = split; 
 +        foreach $word (@line) { 
 +                print "$word\n"; 
 +        } 
 +
 + 
 +</code> 
 + 
 +Next we create a script for the reduce step, note that the use of a hash avoids the sort 
 + 
 +<code> 
 + 
 +vi wc_reducer.pl  # make user executable and add contents below 
 + 
 +#!/usr/bin/perl -w 
 + 
 +# store words in hash and print key-value results 
 + 
 +while (<>) { 
 +        chomp; 
 +        $seen{$_}++; 
 +
 + 
 +foreach $key (keys %seen) { 
 +        print "$seen{$key} $key\n"; 
 +
 + 
 +</code> 
 + 
 +Next test this on the command line 
 + 
 +<code> 
 + 
 +perl wc_mapper.pl HF.txt | perl wc_reducer.pl | sort -rn | head 
 + 
 +6050 and 
 +4708 the 
 +2935 a 
 +2903 to 
 +2475 I 
 +1942 was 
 +1733 of 
 +1427 it 
 +1372 he 
 +1367 in 
 + 
 +</code> 
 + 
 +Load the text input file up to Hadoop's HDFS and submit the job 
 + 
 +<code> 
 + 
 +hadoop fs -put  HF.txt /tmp  
 +hadoop dfs -ls /tmp 
 + 
 +# results 
 +Found 2 items 
 +-rw-r--r--   3 hmeij07   supergroup     459378 2013-05-23 14:24 /tmp/DS.txt.gz 
 +-rw-r--r--   3 hmeij07   supergroup     610155 2013-05-23 14:24 /tmp/HF.txt 
 + 
 +# submit, note that -mapper and -reducer options are paired with a -file option 
 +# pointing to files in our home dirrectory (not HDFS). these files will be copied  
 +# to each datanode for execution, and then the results will be tabulated 
 + 
 +hadoop jar \ 
 +/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \ 
 +-input /tmp/HF.txt -output /tmp/HF.out \ 
 +-file ~/wc_mapper.pl -mapper ~/wc_mapper.pl \ 
 +-file ~/wc_reducer.pl -reducer ~/wc_reducer.pl 
 + 
 +hadoop fs -ls /tmp/HF.out 
 +# results 
 +Found 3 items 
 +-rw-r--r--   3 hmeij07 supergroup          0 2013-05-24 15:10 /tmp/HF.out/_SUCCESS 
 +drwxrwxrwt   - hmeij07 supergroup          0 2013-05-24 15:10 /tmp/HF.out/_logs 
 +-rw-r--r--   3 hmeij07 supergroup     161788 2013-05-24 15:10 /tmp/HF.out/part-00000 
 + 
 + 
 +hadoop fs -cat /tmp/HF.out/part-00000 | sort -rn | head 
 +# results 
 +6050 and 
 +4708 the 
 +2935 a 
 +2903 to 
 +2475 I 
 +1942 was 
 +1733 of 
 +1427 it 
 +1372 he 
 +1367 in 
 + 
 +# clean up space 
 + 
 +</code> 
 + 
 +==== Perl Hadoop's native Streaming #2 ==== 
 + 
 +Adopted from [[http://autofei.wordpress.com/category/java/hadoop-code/]] 
 + 
 +  * Create vectors X = [x1,x2, ...] and Y = [y1,y2, ...] 
 +  * And solve the product Z = [x1*y1, x2*y2, ...] 
 + 
 +First, do this twice in shell 
 + 
 +<code> 
 +for i in `seq 1 1000000` 
 +> do 
 +> echo -e "$i,$RANDOM" >> v_data_large.txt 
 +> done 
 +</code> 
 + 
 +The we'll use the mapper  
 + 
 +<code> 
 +#!/usr/bin/perl 
 +  
 +# convert comma delimited to tab delimited 
 + 
 +while($line=<STDIN>){ 
 +        @fields = split(/,/, $line); 
 +        if ($fields[0] eq '#') { next;} 
 +        if($fields[0] && $fields[1]){ 
 +                print "$fields[0]\t$fields[1]"; 
 +        } 
 +
 +</code> 
 + 
 +And the reducer from the web site 
 + 
 +<code> 
 +#!/usr/bin/perl 
 + 
 +$lastKey=""; 
 +$product=1; 
 +$count=0; 
 + 
 +while($line=<STDIN>){ 
 +@fields=split(/\t/, $line); 
 +$key = $fields[0]; 
 +$value = $fields[1]; 
 +if($lastKey ne "" && $key ne $lastKey){ 
 +if($count==2){ 
 +print "$lastKey\t$product\n"; 
 +
 +$product=$value; 
 +$lastKey=$key; 
 +$count=1; 
 +
 +else{ 
 +$product=$product*$value; 
 +$lastKey=$key; 
 +$count++; 
 +
 +
 +#the last key 
 +if($count==2){ 
 +print "$lastKey\t$product\n"; 
 +</code> 
 + 
 +And submit the job 
 + 
 +<code> 
 + hadoop jar \ 
 +/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \ 
 +-input /tmp/v_data.txt  -output /tmp/v.out \ 
 +-file ~/v_mapper.pl -mapper ~/v_mapper.pl \ 
 +-file ~/v_reducer.pl -reducer ~/v_reducer.pl  
 +</code> 
 + 
 +And that works. 
 + 
 + 
 +==== Perl Hadoop::Streaming ==== 
 + 
 +  * All nodes 
  
   * [[http://search.cpan.org/~spazm/Hadoop-Streaming-0.122420/lib/Hadoop/Streaming.pm]]   * [[http://search.cpan.org/~spazm/Hadoop-Streaming-0.122420/lib/Hadoop/Streaming.pm]]
  
-  * cpaninstall Hadoop::Streaming +<code> 
 + 
 +yum install cpan 
 +cpan> install Hadoop::Streaming  
 + 
 + 
 +Installing /usr/local/share/perl5/Hadoop/Streaming.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Mapper.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Combiner.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Role/Iterator.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Role/Emitter.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input/Iterator.pm 
 +Installing /usr/local/share/perl5/Hadoop/Streaming/Reducer/Input/ValuesIterator.pm 
 + 
 +</code> 
 + 
 +  * How to use this? 
  
 ==== MySQL ==== ==== MySQL ====
cluster/115.txt · Last modified: 2013/09/10 15:04 by hmeij