Extracting rows containing specific value using mapReduce and hadoop

问题

I'm new to developing map-reduce function. Consider I have csv file containing four column data.

For example:

101,87,65,67  
102,43,45,40  
103,23,56,34  
104,65,55,40  
105,87,96,40

Now, I want extract say

40 102  
40 104  
40 105

as those row contain 40 in forth column.

How to write map reduce function?

回答1:

Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.

Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.

I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.

EDIT: since you modified you question with request for Reducer, here are some tips how you can achieve what you want.

One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

来源：https://stackoverflow.com/questions/37004413/extracting-rows-containing-specific-value-using-mapreduce-and-hadoop

标签

Hadoop

MapReduce

feature-extraction