MapReduce | 易学教程

Reuse Hadoop code in Spark efficiently?

阅读更多关于 Reuse Hadoop code in Spark efficiently?

问题 hi,I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I tried to reuse Mapper and Reducer classes of already existing Hadoop code inside spark program. Can somebody tell me how do I achieve this? EDIT: So far, I have been able to reuse mapper class of standard hadoop word-count example in spark, implemented as below wordcount.java import scala.Tuple2; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org

MongoDB query to remove duplicate documents from a collection

阅读更多关于 MongoDB query to remove duplicate documents from a collection

问题 I take data from a search box and then insert into MongoDB as a document using the regular insert query. The data is stored in a collection for the word "cancer" in the following format with unique "_id". { "_id": { "$oid": "553862fa49aa20a608ee2b7b" }, "0": "c", "1": "a", "2": "n", "3": "c", "4": "e", "5": "r" } Each document has a single word stored in the same format as above. I have many documents as such. Now, I want to remove the duplicate documents from the collection. I am unable to

How to use scope variables as property names in a Mongo Map/Reduce emit

阅读更多关于 How to use scope variables as property names in a Mongo Map/Reduce emit

问题 There is a question (and answer) that deals with the general case. I am having difficulty using a scope variable as a field key (as opposed to the field value) In the example below all the FULLY_CAPS fields are scope variables. In the case of SERVICE and IDENTIFIER the emit correctly uses the value of the scope variable as it is passed to the M/R. However when I try to use the value of a scope variable as a key in the emitted document, the document is created with the scope variable name (as

How to use a MapReduce output in Distributed Cache

阅读更多关于 How to use a MapReduce output in Distributed Cache

问题 Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job. How can i use the output file of the first job in the Distributed cache for the second job. 回答1: The below steps might help you, Pass the first job's output directory path to the Second job's Driver class. Use Path Filter to list files starts with part-* . Refer the below code snippet for your second job driver class, FileSystem fs = FileSystem.get

Mapreduce Job -Taking too long to complete

阅读更多关于 Mapreduce Job -Taking too long to complete

问题 We have written a mapreduce job to process log files. As of now we have around 52GB of input files but it is taking around an hour to process the data.It creates only one reducer job by default.Often we get to see a timeout error in the reduce task and then it restarts and gets completed. Below is the stats for the successful completion of the job. Kindly let us know how the performance can be improved. File System Counters FILE: Number of bytes read=876100387 FILE: Number of bytes written

hadoop sort the key and change the key value

阅读更多关于 hadoop sort the key and change the key value

问题 In hadoop, the mapper receives the key as the position in the file like "0, 23, 45, 76, 123", which I think are byte offsets. I have two large input files where I need to split in a manner where the same regions (in terms of number of lines, eg. 400 lines) of the file get the same key. Byte offset is clearly not the best option for that. I was wondering if there is a way or option to change the keys to an integer so the output keys will be: "1, 2, 3, 4, 5" instead of "0, 23, 45, 76, 123"?

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

阅读更多关于 How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

问题 I want to calculate Centered Moving average of a set of Data . Example Input format : quarter | sales Q1'11 | 9 Q2'11 | 8 Q3'11 | 9 Q4'11 | 12 Q1'12 | 9 Q2'12 | 12 Q3'12 | 9 Q4'12 | 10 Mathematical Representation of data and calculation of Moving average and then centered moving average Period Value MA Centered 1 9 1.5 2 8 2.5 9.5 3 9 9.5 3.5 9.5 4 12 10.0 4.5 10.5 5 9 10.750 5.5 11.0 6 12 6.5 7 9 I am stuck with the implementation of RecordReader which will provide mapper sales value of a

Reproducing MongoDB's map/emit functionality in javascript/node.js (without MongoDB)

阅读更多关于 Reproducing MongoDB's map/emit functionality in javascript/node.js (without MongoDB)

问题 I like the functionality that MongoDB provides for doing map/reduce tasks, specifically the emit() in the mapper function. How can I reproduce the map behavior shown below in javascript/node.js without MongoDB? Example (from MongoDB Map-Reduce Docs): [{ cust_id: "A123", amount: 500 }, { cust_id: "A123", amount: 250 }, { cust_id: "B212", amount: 200 }] Mapped to - [{ "A123": [500, 200] }, { "B212": 200 }] A library that makes it as simple as Mongo's one line emit() would be nice but native

XML parsing in Hadoop mapreduce

阅读更多关于 XML parsing in Hadoop mapreduce

问题 I have written a mapreduce code for parsing XML as CSV. But I don't find any output in my output directory after running the job. I am not sure if the file is not read or not written. I am new to Hadoop mapreduce. Can you please help with this? This my entire code. public class XmlParser11 { public static String outvalue; public static class XmlInputFormat1 extends TextInputFormat { public static final String START_TAG_KEY = "xmlinput.start"; public static final String END_TAG_KEY = "xmlinput

Getting exception while trying to execute a Pig Latin Script

阅读更多关于 Getting exception while trying to execute a Pig Latin Script

问题 I am learning Pig on my own and while trying to explore a dataset I am encountering an exception. What is wrong in the script and why: movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double); high = FILTER movies_data by rating > 4.0; high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration; DUMP high_rated; At the end of the MAP Reduce execution I am getting the below