MapReduce | 易学教程

Passing values from Mapper to Reducer

阅读更多关于 Passing values from Mapper to Reducer

问题 There is a little amount of meta-data that I get by looking up the current file the mapper is working on (and a few other things). I need to send over this meta-data to the reducer. Sure, I can have the mapper emit this in the < Key, Value> pair it generates as < Key, Value + Meta-Data> , but I want to avoid it. Also, constraining myself a little bit more, I do not want to use DistributedCahce. So, do I still have some options left? More precisely, my question is twofold (1) I tried setting

Mongodb group by dbref field

阅读更多关于 Mongodb group by dbref field

问题 I need group products by model. Each product has model field - DBRef to Models collection. I tried use this aggregate query, but have error FieldPath field names may not start with '$'. Aggregation query: db.Products.aggregate([ { $project: { _id: 0, model: 1, isActive: 1 } }, { $group: { _id: "$model.$id", actives: { $push: "$isActive" } }} ]); Example of product document: { _id: ObjectId("54f48610e31701d2184dede5"), isActive: true, model: { $db: "database", $ref: "Models", $id: ObjectId("..

How to catch/detect exceptions in multi-threaded map/reduce using Reactor framework 2.x?

阅读更多关于 How to catch/detect exceptions in multi-threaded map/reduce using Reactor framework 2.x?

问题 I was playing with the code of this answer and it works smoothly. However, if an exception is thrown, the caller code does not catch it. How is an Exception captured in reactor 2.0 streams? What I want to do is: if an Exception is thrown, stream processing must stop. I need to throw the Exception up in the caller thread (the one that created the steam in first place). List<Map<String, Object>> data = readData(); Streams.from(data) .flatMap(m -> Streams.just(m) .dispatchOn(Environment

Use wget with Hadoop?

阅读更多关于 Use wget with Hadoop?

问题 I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for

How to change tmp directory in yarn

阅读更多关于 How to change tmp directory in yarn

问题 I have written a MR job and have run it in local mode with following configuration settings mapred.local.dir=<<local directory having good amount of space>> fs.default.name=file:/// mapred.job.tracker=local on Hadoop 1.x Now I am using Hadoop 2.x and the same Job I am running with the same Configuration settings, but I am getting error : Disk Out of Space Is it that If I switch from Hadoop 1.x to 2.x (using Hadoop-2.6 jars), the same Configuration Settings to change the Tmp Dir not work.??

Is there a workaround to allow using a regex in the Mongodb aggregation pipeline

阅读更多关于 Is there a workaround to allow using a regex in the Mongodb aggregation pipeline

问题 I'm trying to create a pipeline which will count how many documents match some conditions. I can't see any way to use a regular expression in the conditions though. Here's a simplified version of my pipeline with annotations: db.Collection.aggregate([ // Pipeline before the issue {'$group': { '_id': { 'field': '$my_field', // Included for completeness }, 'first_count': {'$sum': { // We're going to count the number '$cond': [ // of documents that have 'foo' in {'$eq: ['$field_foo', 'foo']}, 1,

Get document's placement in collection based on sort order

阅读更多关于 Get document's placement in collection based on sort order

问题 I'm new to MongoDB (+Mongoose). I have a collection of highscores with documents that looks like this: {id: 123, user: 'User14', score: 101} {id: 231, user: 'User10', score: 400} {id: 412, user: 'User90', score: 244} {id: 111, user: 'User12', score: 310} {id: 221, user: 'User88', score: 900} {id: 521, user: 'User13', score: 103} + thousands more... now I'm getting the top 5 players like so: highscores .find() .sort({'score': -1}) .limit(5) .exec(function(err, users) { ...code... }); which is

couchbase reduce by date in node.js

阅读更多关于 couchbase reduce by date in node.js

问题 I have to make some code which filter active user with couchbase, node.js I have some users documents, and I made a view with the following coded : I made a view called "bydate" with the following code : function (doc, meta) { if(meta.type == 'json') { if(doc.type == 'user') { if (doc.lastUpdate){ emit(dateToArray(doc.lastUpdate),doc.name); } } } } I have to filter by day, month or year using the "group_level" setting in couchbase console , however I have unable to filter it properly on node

Debugging hadoop in eclipse

阅读更多关于 Debugging hadoop in eclipse

问题 Is it possible to debug Hadoop's source code in Eclipse?I'm not asking about the map reduce tasks. I want to see which part of the Hadoop source code is responsible for scheduling the map reduce tasks and how it works. Is there any mechanism by which it can be done? 回答1: You can download Hadoop project and integrate it to your eclipse, and use F5 or F6 to debug. You have different mode of debugging in eclipse: F5 : Step by Step debugging F6 : Skips loops and Subroutines F7 : Skips the loop or

Job via Oozie HDP 2.1 not creating job.splitmetainfo

阅读更多关于 Job via Oozie HDP 2.1 not creating job.splitmetainfo

问题 When trying to execute a sqoop job which has my Hadoop program passed as a jar file in -jarFiles parameter, the execution blows off with below error. Any resolution seems to be not available. Other jobs with same Hadoop user is getting executed successfully. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.FileNotFoundException: File does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/.staging/job_1423050964699_0003/job.splitmetainfo at org.apache.hadoop.mapreduce.v2