MapReduce | 易学教程

How to re-arrange wordcount hadoop output result and sort them by value

阅读更多关于 How to re-arrange wordcount hadoop output result and sort them by value

问题 I use this code below to get output result like ( Key , Value ) Apple 12 Bee 345 Cat 123 What I want is descending sorted by value ( 345 ) and place them before the key ( Value , Key ) 345 Bee 123 Cat 12 Apple I found there are something called "secondary sorted" not going to lie but I'm so lost - I tried to change .. context.write(key, result); but failed miserably. I'm new to Hadoop and not sure how can I start to tackle this problem. Any recommendation would be appreciated. Which function

Hadoop MapReduce Streaming output different from the output of running MapReduce locally

阅读更多关于 Hadoop MapReduce Streaming output different from the output of running MapReduce locally

问题 I am running a simple mapreduce job written in python and I noticed that when I test the script locally, i obtain a different out then when I run the job on hadoop. My input is something of a kind: key1 val1 key1 val2 key1 val3 key1 val4 key2 val1 key2 val3 key2 val5 key3 val5 key4 val4 My mapper creates a dictionary of values with their corresponding list (string) of keys (e.g. val1 key1,key2 ; val2 key1 ; val3 key1,key2 ....). Then for each value in the dictionary I print all the possible

MongoDb : mapReduce out collection result

阅读更多关于 MongoDb : mapReduce out collection result

问题 I have the below Logincount.js . Please tell me how can i Include the Date Field also while creating the collection LoginCount ?? Right now the js file creates a collection with two fields namely _id and value I want it to also create a field called date with yesterdays Date in it . This is my js file m = function() { emit(this.cust_id, 1); } r = function (k, vals) { var sum = 0; for (var i in vals) { sum += vals[i]; } return sum; } q = function() { var currentDate = new Date(); currentDate

How to emit 2D double array from mapper using TwoDArrayWritable

阅读更多关于 How to emit 2D double array from mapper using TwoDArrayWritable

问题 I want to emit a 2D double array using TwoDArrayWritable as value . how to write the context.write(key , ) EDIT And in Reducer how to get them in a Two Dimensional double array and print the values. I Wrote in Mapper row = E.length; col = E[0].length; TwoDArrayWritable array = new TwoDArrayWritable (DoubleWritable.class); DoubleWritable[][] myInnerArray = new DoubleWritable[row][col]; // set values in myInnerArray for (int k1 = 0; k1 < row; k1++) { for(int j1=0;j1< col;j1++){ myInnerArray[k1]

Hadoop M/R secondary sort not working, bases on last name of the user

阅读更多关于 Hadoop M/R secondary sort not working, bases on last name of the user

问题 I want to sort the output based on lastname of the user, the key being used is firstName. Following are the classes which i am using but i am not getting sorted output based on lastName. I am new to hadoop,this i wrote using help from various internet sources. Main Class :- public class WordCount { public static class Map extends Mapper<LongWritable, Text, CustomKey, Text> { public static final Log log = LogFactory.getLog(Map.class); private final static IntWritable one = new IntWritable(1);

PySpark - Combining Session Data without Explicit Session Key / Iterating over All Rows

阅读更多关于 PySpark - Combining Session Data without Explicit Session Key / Iterating over All Rows

问题 I am trying to aggregate session data without a true session "key" in PySpark. I have data where an individual is detected in an area at a specific time, and I want to aggregate that into a duration spent in each area during a specific visit (see below). The tricky part here is that I want to infer the time someone exits each area as the time they are detected in the next area. This means that I will need to use the start time of the next area ID as the end time for any given area ID. Area

accessing mongodb's object from mapper (MapReduce)

阅读更多关于 accessing mongodb's object from mapper (MapReduce)

问题 I have an extra question based on the one I asked before: calculate frequency using mongodb aggregate framework so my data in MongoDB looks like this now: { "data": { "interaction": { "created_at": "Wed, 09 Apr 2014 14:38:16 +0000" } }, "_id": { "$oid": "53455b59edcd5e4e3fdd4ebb" } } before I used to have it like: [ { created_at: "2014-03-31T22:30:48.000Z", id: 450762158586880000, _id: "5339ec9808eb125965f2eae1" } ] so to access created_at I was using mapper like: var mapper = function () {

Hadoop setting maxium simultaneous map/reduce task does not work in Psedue mode

阅读更多关于 Hadoop setting maxium simultaneous map/reduce task does not work in Psedue mode

问题 I configured hadoop 2.4.1 in a single machine (4-core) to use the Psedue Distributed mode, and I am able to run my map/reduce program via the hadoop shell command on the HDFS input files. But I notice that the map and reduce look like still running in single thread. So I tried to hard-code the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum, both to 4. (Just for trying I know it is not ideal setting). But I still see the map and reduce tasks

MapReduce aggregation based on attributes contained outside of document

阅读更多关于 MapReduce aggregation based on attributes contained outside of document

问题 Say I have a collection of 'activities', each of which has a name, cost and location: {_id : 1 , name: 'swimming', cost: '3.40', location: 'kirkstall'} {_id : 2 , name: 'cinema', cost: '6.50', location: 'hyde park'} {_id : 3 , name: 'gig', cost: '10.00', location: 'hyde park'} I also have a people collection which records, for each activity, how many times they plan to do each in a year: {_id : 1 , name: 'russell', activities : { {1 : 9} , {2 : 4} , {3 : 21} }} I don't want to denormalise the

Aggregate of different subtypes in document of a collection

阅读更多关于 Aggregate of different subtypes in document of a collection

问题 abstract document in collection md given: { vals : [{ uid : string, val : string|array }] } the following, partially correct aggregation is given: db.md.aggregate( { $unwind : "$vals" }, { $match : { "vals.uid" : { $in : ["x", "y"] } } }, { $group : { _id : { uid : "$vals.uid" }, vals : { $addToSet : "$vals.val" } } } ); that may lead to the following result: "result" : [ { "_id" : { "uid" : "x" }, "vals" : [ [ "24ad52bc-c414-4349-8f3a-24fd5520428e", "e29dec2f-57d2-43dc-818a-1a6a9ec1cc64" ],