MapReduce

MongoDB Aggregate Sum Each Key on a Subdocument

本秂侑毒 提交于 2019-12-17 20:24:48
问题 I have multiple documents with this schema, each document is per product per day: { _id:{}, app_id:'DHJFK67JDSJjdasj909', date:'2014-08-07', event_count:32423, event_count_per_type: { 0:322, 10:4234, 20:653, 30:7562 } } I would like to get the sum of each event_type for a particular date_range. This is the output I am looking for where each event type has been summed across all the documents. The keys for event_count_per_type can be anything, so I need something that can loop through each of

MultipleTextOutputFormat alternative in new API

你。 提交于 2019-12-17 19:40:01
问题 As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ? 回答1: I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class: public void write(KEYOUT key, VALUEOUT value, String

MapReduce shuffle/sort method

三世轮回 提交于 2019-12-17 19:16:54
问题 Somewhat of an odd question, but does anyone know what kind of sort MapReduce uses in the sort portion of shuffle/sort? I would think merge or insertion (in keeping with the whole MapReduce paradigm), but I'm not sure. 回答1: It's Quicksort, afterwards the sorted intermediate outputs get merged together. Quicksort checks the recursion depth and gives up when it is too deep. If this is the case, Heapsort is used. Have a look at the Quicksort class: org.apache.hadoop.util.QuickSort You can change

Launch a mapreduce job from eclipse

我的未来我决定 提交于 2019-12-17 18:43:41
问题 I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps: export the mapreuce job as a jar (e.g. myMRjob.jar ) submit the job to the remote cluster using the following shell command: hadoop jar myMRjob.jar I would like to submit the job directly from Eclipse when I try to run the program. How can I do this? I am currently using CDH3, and an abridged version of my conf is: conf.set("hbase

What are the connections and differences between Hadoop Writable and java.io.serialization?

被刻印的时光 ゝ 提交于 2019-12-17 18:33:43
问题 To implement Writable interface, object can be serialized in Hadoop. So what are the connections and differences between Hadoop Writable and java.io.serialization ? 回答1: Underlying storage differences: Java Serializable Serializable does not assume the class of stored values is known and tags instances with its class ie. it writes the metadata about the object, which includes the class name, field names and types, and its superclass. ObjectOutputStream and ObjectInputStream optimize this

How to calculate the running total using aggregate

跟風遠走 提交于 2019-12-17 18:25:02
问题 I'm developing a simple financial app for keeping track of incomes and outcomes. For the sake of simplicity, let's suppose these are some of my documents: { "_id" : ObjectId("54adc0659413535e02fba115"), "description" : "test1", "amount" : 100, "dateEntry" : ISODate("2015-01-07T23:00:00Z") } { "_id" : ObjectId("54adc21a0d150c760270f99c"), "description" : "test2", "amount" : 50, "dateEntry" : ISODate("2015-01-06T23:00:00Z") } { "_id" : ObjectId("54b05da766341e4802b785c0"), "description" :

Hadoop: how to access (many) photo images to be processed by map/reduce?

こ雲淡風輕ζ 提交于 2019-12-17 17:46:22
问题 I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis. Specifically, Should I upload all of the photos to HDFS ? Assume yes, how can I use them in

What are SUCCESS and part-r-00000 files in hadoop

送分小仙女□ 提交于 2019-12-17 17:43:10
问题 Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000 ? Is there any significance/any nomenclature or is this just a randomly defined? 回答1: See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ On the successful completion of a job, the MapReduce runtime creates a _SUCCESS

hadoop java.net.URISyntaxException: Relative path in absolute URI: rsrc:hbase-common-0.98.1-hadoop2.jar

≡放荡痞女 提交于 2019-12-17 16:33:31
问题 I have a map reduce job that connects to HBASE and I can't figure out where I am running into this error: Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.eclipse.jdt.internal

Simple explanation of MapReduce?

久未见 提交于 2019-12-17 15:21:37
问题 Related to my CouchDB question. Can anyone explain MapReduce in terms a numbnuts could understand? 回答1: Going all the way down to the basics for Map and Reduce. Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list. suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop,