MapReduce

Passing values from Mapper to Reducer

本小妞迷上赌 提交于 2019-12-24 04:28:08
问题 There is a little amount of meta-data that I get by looking up the current file the mapper is working on (and a few other things). I need to send over this meta-data to the reducer. Sure, I can have the mapper emit this in the < Key, Value> pair it generates as < Key, Value + Meta-Data> , but I want to avoid it. Also, constraining myself a little bit more, I do not want to use DistributedCahce. So, do I still have some options left? More precisely, my question is twofold (1) I tried setting

Mongodb group by dbref field

霸气de小男生 提交于 2019-12-24 04:27:32
问题 I need group products by model. Each product has model field - DBRef to Models collection. I tried use this aggregate query, but have error FieldPath field names may not start with '$'. Aggregation query: db.Products.aggregate([ { $project: { _id: 0, model: 1, isActive: 1 } }, { $group: { _id: "$model.$id", actives: { $push: "$isActive" } }} ]); Example of product document: { _id: ObjectId("54f48610e31701d2184dede5"), isActive: true, model: { $db: "database", $ref: "Models", $id: ObjectId("..

How to catch/detect exceptions in multi-threaded map/reduce using Reactor framework 2.x?

倾然丶 夕夏残阳落幕 提交于 2019-12-24 04:25:09
问题 I was playing with the code of this answer and it works smoothly. However, if an exception is thrown, the caller code does not catch it. How is an Exception captured in reactor 2.0 streams? What I want to do is: if an Exception is thrown, stream processing must stop. I need to throw the Exception up in the caller thread (the one that created the steam in first place). List<Map<String, Object>> data = readData(); Streams.from(data) .flatMap(m -> Streams.just(m) .dispatchOn(Environment

Use wget with Hadoop?

梦想的初衷 提交于 2019-12-24 04:19:11
问题 I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for

How to change tmp directory in yarn

一个人想着一个人 提交于 2019-12-24 03:53:14
问题 I have written a MR job and have run it in local mode with following configuration settings mapred.local.dir=<<local directory having good amount of space>> fs.default.name=file:/// mapred.job.tracker=local on Hadoop 1.x Now I am using Hadoop 2.x and the same Job I am running with the same Configuration settings, but I am getting error : Disk Out of Space Is it that If I switch from Hadoop 1.x to 2.x (using Hadoop-2.6 jars), the same Configuration Settings to change the Tmp Dir not work.??

Is there a workaround to allow using a regex in the Mongodb aggregation pipeline

纵饮孤独 提交于 2019-12-24 03:43:32
问题 I'm trying to create a pipeline which will count how many documents match some conditions. I can't see any way to use a regular expression in the conditions though. Here's a simplified version of my pipeline with annotations: db.Collection.aggregate([ // Pipeline before the issue {'$group': { '_id': { 'field': '$my_field', // Included for completeness }, 'first_count': {'$sum': { // We're going to count the number '$cond': [ // of documents that have 'foo' in {'$eq: ['$field_foo', 'foo']}, 1,

Get document's placement in collection based on sort order

不问归期 提交于 2019-12-24 03:29:24
问题 I'm new to MongoDB (+Mongoose). I have a collection of highscores with documents that looks like this: {id: 123, user: 'User14', score: 101} {id: 231, user: 'User10', score: 400} {id: 412, user: 'User90', score: 244} {id: 111, user: 'User12', score: 310} {id: 221, user: 'User88', score: 900} {id: 521, user: 'User13', score: 103} + thousands more... now I'm getting the top 5 players like so: highscores .find() .sort({'score': -1}) .limit(5) .exec(function(err, users) { ...code... }); which is

couchbase reduce by date in node.js

旧城冷巷雨未停 提交于 2019-12-24 03:26:58
问题 I have to make some code which filter active user with couchbase, node.js I have some users documents, and I made a view with the following coded : I made a view called "bydate" with the following code : function (doc, meta) { if(meta.type == 'json') { if(doc.type == 'user') { if (doc.lastUpdate){ emit(dateToArray(doc.lastUpdate),doc.name); } } } } I have to filter by day, month or year using the "group_level" setting in couchbase console , however I have unable to filter it properly on node

Debugging hadoop in eclipse

白昼怎懂夜的黑 提交于 2019-12-24 03:24:21
问题 Is it possible to debug Hadoop's source code in Eclipse?I'm not asking about the map reduce tasks. I want to see which part of the Hadoop source code is responsible for scheduling the map reduce tasks and how it works. Is there any mechanism by which it can be done? 回答1: You can download Hadoop project and integrate it to your eclipse, and use F5 or F6 to debug. You have different mode of debugging in eclipse: F5 : Step by Step debugging F6 : Skips loops and Subroutines F7 : Skips the loop or

Job via Oozie HDP 2.1 not creating job.splitmetainfo

ぃ、小莉子 提交于 2019-12-24 03:20:49
问题 When trying to execute a sqoop job which has my Hadoop program passed as a jar file in -jarFiles parameter, the execution blows off with below error. Any resolution seems to be not available. Other jobs with same Hadoop user is getting executed successfully. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.FileNotFoundException: File does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/.staging/job_1423050964699_0003/job.splitmetainfo at org.apache.hadoop.mapreduce.v2