MapReduce

Converting some fields in Mongo from String to Array

大城市里の小女人 提交于 2019-12-10 14:31:11
问题 I have a collection of documents where a "tags" field was switched over from being a space separated list of tags to an array of individual tags. I want to update the previous space-separated fields to all be arrays like the new incoming data. I'm also having problems with the $type selector because it is applying the type operation to individual array elements, which are strings. So filtering by type just returns everything. How can I get every document that looks like the first example into

Reducers stopped working at 66.68% while running HIVE Join query

丶灬走出姿态 提交于 2019-12-10 13:52:48
问题 Trying to join 6 tables which are having 5 million rows approximately in each table. Trying to join on account number which is sorted in ascending order on all tables. Map tasks are successfully finished and reducers stopped working at 66.68%. Tried options like increasing number of reducers and also tried other options set hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is same. Tried with small number

Get index of given element in array field in MongoDB

北城余情 提交于 2019-12-10 13:22:13
问题 Think of this MongoDB document: {_id:123, "food":[ "apple", "banana", "mango" ]} Question: How to get the position of mango in food? The query should return 2 in above, and don't return the whole document. Please kindly show the working query. 回答1: Starting from MongoDB version 3.4 we can use the $indexOfArray operator to return the index at which a given element can be found in the array. $indexOfArray takes three arguments. The first is the name of the array field prefixed with $ sign. The

Is it possible to disable sorting in hadoop?

戏子无情 提交于 2019-12-10 13:07:58
问题 My job dosn't require sorting, just aggregation information per key. So I think if it possible to disable sorting of all information in order of increasing performance. Note: I can't set reducers count to zero because I need to aggregate data between many mappers. I just not interested in sorted result withing one reducer. 回答1: One of the main purpose to sort the map output is, when the tuples reaches reducer, reducer has to make ) to invoke reducer task, with the sorted map output list it

hadoop-streaming: reducer in pending state, doesn't start?

半城伤御伤魂 提交于 2019-12-10 12:08:40
问题 I have a map reduce job which was running fine until I started to see some failed map tasks like attempt_201110302152_0003_m_000010_0 task_201110302152_0003_m_000010 worker1 FAILED Task attempt_201110302152_0003_m_000010_0 failed to report status for 602 seconds. Killing! ------- Task attempt_201110302152_0003_m_000010_0 failed to report status for 607 seconds. Killing! Last 4KB Last 8KB All attempt_201110302152_0003_m_000010_1 task_201110302152_0003_m_000010 master FAILED java.lang

How can I tell how many mappers and reducers are running?

拥有回忆 提交于 2019-12-10 11:52:54
问题 I have a task that is designed to run dozens of map/reduce jobs. Some of them are IO intensive, some are mapper intensive, some are reducer intensive. I would like to be able to monitor the number of mappers and reducers currently in use so that, when a set of mappers is freed up, I can push another mapper intensive job to the cluster. I don't want to just stack them up on the queue because they might clog up the mapper and not let the reducer-intensive ones run. Is there a command line

MapReduce java program to calaculate max temperature not starting to run,it is run on local desktop importing external jar files

末鹿安然 提交于 2019-12-10 11:43:53
问题 1>THIS IS MY MAIN METHOD package dataAnalysis; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.TextOutputFormat; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; public class Weather {

How to process/extract .pst using hadoop Map reduce

痴心易碎 提交于 2019-12-10 11:30:25
问题 I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable. How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full . Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File) And the problem is for Hadoop API 's we don't have anything close to java.io.File . Following option is always

How to chain mapper/reducer in Hadoop 1.0.4?

亡梦爱人 提交于 2019-12-10 11:29:02
问题 I was using "new" API of Hadoop 1.0.4 (classes in package org.apache.hadoop.mapreduce). When I wanted to chain mapper/reducer, I found out that ChainMapper, ChainReducer are written for the "old" API (classes in package org.apache.hadoop.mapred). What should I do? 回答1: I was also searching for the same. I did get the answer and even though its late I thought sharing this may help someone. From Hadoop 2.0 onwards you can find ChainMapper and ChainReducer in the package org.apache.hadoop

Filter a static RavenDB map/reduce index

可紊 提交于 2019-12-10 11:19:14
问题 Scenario/Context Raven 2.0 on RavenHQ Web app, so async is preferred My application is a survey application. Each Survey has an array of Questions ; and conversely, each Submission (an individual's response to a survey) has an array of Answers . I have a static index that aggregates all answers so that I can display a chart based on the responses (e.g. for each question on each survey, how many people selected each option). These data are used to render, for example, a pie chart. This