MapReduce

Mapping through two data sets with Hadoop

戏子无情 提交于 2019-12-23 15:58:16
问题 Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys. Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient. Would there be a recommended way to do this that doesn't require

Spark - Group by Key then Count by Value

别等时光非礼了梦想. 提交于 2019-12-23 15:43:54
问题 I have non-unique key-value pairs that I have created using the map function from an RDD Array[String] val kvPairs = myRdd.map(line => (line(0), line(1))) This produces data of format: 1, A 1, A 1, B 2, C I would like to group all of they keys by their values and provide the counts for these values like so: 1, {(A, 2), (B, 1)} 2, {(C, 1)} I have tried many different attempts, but the closest I can get is with something like this: kvPairs.sortByKey().countByValue() This gives 1, (A, 2) 1, (B,

Why should a Writable datatype be Mutable?

大憨熊 提交于 2019-12-23 12:36:26
问题 Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes? Thanks & Regards, Raja 回答1: You can't choose, these datatypes must be mutable. The reason is the serialization mechanism. Let's look at the code: // version 1.x MapRunner#run() K1 key = input.createKey(); V1 value = input.createValue(); while (input.next(key, value)) { // map pair to output mapper.map(key, value, output,

How does mapreduce sort and shuffle work?

雨燕双飞 提交于 2019-12-23 12:26:23
问题 I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase (1, 24) (4, 25) (3, 26) I know the sort and shuffle phase will produce following output (1, 24) (3, 26) (4, 25) Which is as expected But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first

MongoDB Aggregate for a sum on a per week basis for all prior weeks

倖福魔咒の 提交于 2019-12-23 12:00:02
问题 I've got a series of docs in MongoDB. An example doc would be { createdAt: Mon Oct 12 2015 09:45:20 GMT-0700 (PDT), year: 2015, week: 41 } Imagine these span all weeks of the year and there can be many in the same week. I want to aggregate them in such a way that the resulting values are a sum of each week and all its prior weeks counting the total docs. So if there were something like 10 in the first week of the year and 20 in the second, the result could be something like [{ week: 1, total:

MongoDB Aggregate for a sum on a per week basis for all prior weeks

…衆ロ難τιáo~ 提交于 2019-12-23 11:59:28
问题 I've got a series of docs in MongoDB. An example doc would be { createdAt: Mon Oct 12 2015 09:45:20 GMT-0700 (PDT), year: 2015, week: 41 } Imagine these span all weeks of the year and there can be many in the same week. I want to aggregate them in such a way that the resulting values are a sum of each week and all its prior weeks counting the total docs. So if there were something like 10 in the first week of the year and 20 in the second, the result could be something like [{ week: 1, total:

How to parse CustomWritable from text in Hadoop

限于喜欢 提交于 2019-12-23 11:56:48
问题 Say I have timestamped values for specific users in text files, like #userid; unix-timestamp; value 1; 2010-01-01 00:00:00; 10 2; 2010-01-01 00:00:00; 20 1; 2010-01-01 01:00:00; 11 2; 2010-01-01 01:00:00, 21 1; 2010-01-02 00:00:00; 12 2; 2010-01-02 00:00:00; 22 I have a custom class "SessionSummary" implementing readFields and write of WritableComparable . It's purpose is to sum up all values per user for each calendar day. So the mapper maps the lines to each user, the reducer summarizes all

How to parse CustomWritable from text in Hadoop

风格不统一 提交于 2019-12-23 11:56:17
问题 Say I have timestamped values for specific users in text files, like #userid; unix-timestamp; value 1; 2010-01-01 00:00:00; 10 2; 2010-01-01 00:00:00; 20 1; 2010-01-01 01:00:00; 11 2; 2010-01-01 01:00:00, 21 1; 2010-01-02 00:00:00; 12 2; 2010-01-02 00:00:00; 22 I have a custom class "SessionSummary" implementing readFields and write of WritableComparable . It's purpose is to sum up all values per user for each calendar day. So the mapper maps the lines to each user, the reducer summarizes all

MapReduce - How sort reduce output by value

别等时光非礼了梦想. 提交于 2019-12-23 11:52:58
问题 How can I sort in decreasing order the reducer output by value? I'm developing an application that must return top listened songs. Thus songs must be ordered by the number of listening. My application works in this way: Input: songname@userid@boolean MapOutput : songname userid ReduceOutput : songname number_of_listening Any idea how to do this? 回答1: Per the docs, Reducer output is not re-sorted. Either sort the input to the reducer (if that works for your application) by setting an

Getting the Tool Interface warning even though it is implemented

自古美人都是妖i 提交于 2019-12-23 09:32:11
问题 I have a very simple "Hello world" style map/reduce job. public class Tester extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Job job = Job.getInstance(new Configuration()); job.setJarByClass(getClass()); getConf().set("mapreduce.job.queuename", "adhoc");