MapReduce | 易学教程

multiple input into a Mapper in hadoop

阅读更多关于 multiple input into a Mapper in hadoop

问题 I am trying to send two files to a hadoop reducer. I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper. right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system. is there anyway to send values to the mapper except the file that it's reading? 回答1: I also had a lot of

Load MapReduce output data into HBase

阅读更多关于 Load MapReduce output data into HBase

问题 The last few days I've been experimenting with Hadoop. I'm running Hadoop in pseudo-distributed mode on Ubuntu 12.10 and successfully executed some standard MapReduce jobs. Next I wanted to start experimenting with HBase. I've installed HBase, played a bit in the shell. That all went fine so I wanted to experiment with HBase through a simple Java program. I wanted to import the output of one of the previous MapReduce jobs and load it into an HBase table. I've wrote a Mapper that should

couchdb - startkey endkey doesn't filter records with key arrays as expected

阅读更多关于 couchdb - startkey endkey doesn't filter records with key arrays as expected

问题 I have Couch-DB map-reduce view which outputs this; { rows: [ { key: [ "2014-08-20", 2, "registration" ], value: 2 }, { key: [ "2014-08-20", 2, "search" ], value: 3 }, { key: [ "2014-08-21", 2, "registration" ], value: 3 }, { key: [ "2014-08-21", 2, "search" ], value: 4 } ] } I need to query all the records that has between 2014-08-20 and 2014-08-21 Also the same time I need the integer value in the middle to be 2 and the last value to be "registration". My curl request URL looks like this

HADOOP - Reduce Phase Hangs on Simple MR Job

阅读更多关于 HADOOP - Reduce Phase Hangs on Simple MR Job

问题 Here is a simple map reduce job. Initially this is just a simple way of copying files in an input directory to an output directory. The Map phase completes, but the reduce phase just hangs. What am I doing wrong? It is a small amount of code, here is the whole job: import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org

Which runs first, Combiner or Partitioner in a MapReduce Job

阅读更多关于 Which runs first, Combiner or Partitioner in a MapReduce Job

问题 I am confused since I have found two answers for it. 1) As per Hadoop Definitive Guide - 3rd edition, Chapter 6 - The Map Side says: "Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. 2)Yahoo developers tutorial (Yahoo tutorial) says Combiner runs

HBase completebulkload returns exception

阅读更多关于 HBase completebulkload returns exception

问题 I am trying to bulk-populate an HBase table quickly from a text file (several GB) by using the bulk loading method described in the Hadoop docs. I have created an HFile which I now want to push to my HBase table. When I use this command: hadoop jar /home/hxcaine/hadoop/lib/hbase.jar completebulkload /user/hxcaine/dbpopulate/output/cf1 my_hbase_table The job starts and then I get this exception: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/util/concurrent

Check status of running MongoDB map reduce jobs

阅读更多关于 Check status of running MongoDB map reduce jobs

问题 How do I check on the status of running map reduce jobs in Mongo DB? My code can run Mongo map reduce jobs, but I'd like to have a status table, listing jobs as "in progress" or "complete". How do I get that information from Mongo DB? 回答1: You can query for all running jobs using db.currentOp(). Usualy a Map/Reduce job has a few attributes you can query for. A M/R job I just ran had the following stats: "opid" : 258101377, "active" : true, "secs_running" : 4638, "op" : "query", "ns" : "

Job hanging when example run on hadoop 0.23.0

阅读更多关于 Job hanging when example run on hadoop 0.23.0

问题 I am trying to add capacity scheduler in hadoop 0.23.0 and trying to run a sample pi, randomwriter program. All the daemons are up and working fine, but the job is getting hanged and no more output is getting displayed. I couldnt able to see the logs where they are accumulated. Can anyone please let me know the reason for this hanging of the job, and location where the logs are stored. 2012-06-08 18:41:06,118 INFO mapred.YARNRunner (YARNRunner.java:createApplicationSubmissionContext(355)) -

# of failed Map Tasks exceeded allowed limit

阅读更多关于 # of failed Map Tasks exceeded allowed limit

问题 I am trying my hands on Hadoop streaming using Python. I have written simple map and reduce scripts by taking help from here map script is as follows : #!/usr/bin/env python import sys, urllib, re title_re = re.compile("<title>(.*?)</title>", re.MULTILINE | re.DOTALL | re.IGNORECASE) for line in sys.stdin: url = line.strip() match = title_re.search(urllib.urlopen(url).read()) if match : print url, "\t", match.group(1).strip() and reduce script is as follows : #!/usr/bin/env python from

crossfilter dimension on 2 fields

阅读更多关于 crossfilter dimension on 2 fields

问题 My data looks like this field1,field2,value1,value2 a,b,1,1 b,a,2,2 c,a,3,5 b,c,6,7 d,a,6,7 I don't have a good way of rearranging that data so let's assume the data has to stay like this. I want to create a dimension on field1 and field2 combined : a single dimension that would take the union of all values in both field1 and field2 (in my example, the values should be [a,b,c,d] ) As a reduce function you can assume reduceSum on value2 for example (allowing double counting for now). (have