MapReduce | 易学教程

Mapping through two data sets with Hadoop

阅读更多关于 Mapping through two data sets with Hadoop

问题 Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys. Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient. Would there be a recommended way to do this that doesn't require

Spark - Group by Key then Count by Value

阅读更多关于 Spark - Group by Key then Count by Value

问题 I have non-unique key-value pairs that I have created using the map function from an RDD Array[String] val kvPairs = myRdd.map(line => (line(0), line(1))) This produces data of format: 1, A 1, A 1, B 2, C I would like to group all of they keys by their values and provide the counts for these values like so: 1, {(A, 2), (B, 1)} 2, {(C, 1)} I have tried many different attempts, but the closest I can get is with something like this: kvPairs.sortByKey().countByValue() This gives 1, (A, 2) 1, (B,

Why should a Writable datatype be Mutable?

阅读更多关于 Why should a Writable datatype be Mutable?

问题 Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes? Thanks & Regards, Raja 回答1: You can't choose, these datatypes must be mutable. The reason is the serialization mechanism. Let's look at the code: // version 1.x MapRunner#run() K1 key = input.createKey(); V1 value = input.createValue(); while (input.next(key, value)) { // map pair to output mapper.map(key, value, output,

How does mapreduce sort and shuffle work?

阅读更多关于 How does mapreduce sort and shuffle work?

问题 I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase (1, 24) (4, 25) (3, 26) I know the sort and shuffle phase will produce following output (1, 24) (3, 26) (4, 25) Which is as expected But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first

MongoDB Aggregate for a sum on a per week basis for all prior weeks

阅读更多关于 MongoDB Aggregate for a sum on a per week basis for all prior weeks

问题 I've got a series of docs in MongoDB. An example doc would be { createdAt: Mon Oct 12 2015 09:45:20 GMT-0700 (PDT), year: 2015, week: 41 } Imagine these span all weeks of the year and there can be many in the same week. I want to aggregate them in such a way that the resulting values are a sum of each week and all its prior weeks counting the total docs. So if there were something like 10 in the first week of the year and 20 in the second, the result could be something like [{ week: 1, total:

MongoDB Aggregate for a sum on a per week basis for all prior weeks

阅读更多关于 MongoDB Aggregate for a sum on a per week basis for all prior weeks

How to parse CustomWritable from text in Hadoop

阅读更多关于 How to parse CustomWritable from text in Hadoop

问题 Say I have timestamped values for specific users in text files, like #userid; unix-timestamp; value 1; 2010-01-01 00:00:00; 10 2; 2010-01-01 00:00:00; 20 1; 2010-01-01 01:00:00; 11 2; 2010-01-01 01:00:00, 21 1; 2010-01-02 00:00:00; 12 2; 2010-01-02 00:00:00; 22 I have a custom class "SessionSummary" implementing readFields and write of WritableComparable . It's purpose is to sum up all values per user for each calendar day. So the mapper maps the lines to each user, the reducer summarizes all

How to parse CustomWritable from text in Hadoop

阅读更多关于 How to parse CustomWritable from text in Hadoop

MapReduce - How sort reduce output by value

阅读更多关于 MapReduce - How sort reduce output by value

问题 How can I sort in decreasing order the reducer output by value? I'm developing an application that must return top listened songs. Thus songs must be ordered by the number of listening. My application works in this way: Input: songname@userid@boolean MapOutput : songname userid ReduceOutput : songname number_of_listening Any idea how to do this? 回答1: Per the docs, Reducer output is not re-sorted. Either sort the input to the reducer (if that works for your application) by setting an

Getting the Tool Interface warning even though it is implemented

阅读更多关于 Getting the Tool Interface warning even though it is implemented

问题 I have a very simple "Hello world" style map/reduce job. public class Tester extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Job job = Job.getInstance(new Configuration()); job.setJarByClass(getClass()); getConf().set("mapreduce.job.queuename", "adhoc");