问题
I am trying to create a variation of the word count hadoop program in which it reads multiple files in a directory and outputs the frequency of each word. The thing is, I want it to output a word followed by the file name is came from and the frequency from that file. for example:
word1
( file1, 10)
( file2, 3)
( file3, 20)
So for word1 (say the word "and"). It finds it 10 times is file1, 3 times in file2, ect. Right now it is outputing only a key value pair
StringTokenizer itr = new StringTokenizer(chapter);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
I can get the file name by
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
But I do not understand how to format the way I want. I've been looking into OutputCollector, but I am unsure of how to use it exactly.
EDIT: This is my mapper and recuder
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//Take out all non letters and make all lowercase
String chapter = value.toString();
chapter = chapter.toLowerCase();
chapter = chapter.replaceAll("[^a-z]"," ");
//This is the file name
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
StringTokenizer itr = new StringTokenizer(chapter);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, new Text(fileName)); //
}
}
}
public static class IntSumReducer
extends Reducer<Text,Text,Text,Text> { second
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Integer> files = new HashMap<String, Integer>();
for (Text val : values) {
if (files.containsKey(val.toString())) {
files.put(val.toString(), files.get(val.toString())+1);
} else {
files.put(val.toString(), 1);
}
}
String outputString="";
for (String file : files.keySet()) {
outputString = outputString + "\n<" + file + ", " + files.get(file) + ">"; //files.get(file)
}
context.write(key, new Text(outputString));
}
}
This is outputting for the word "a" for example:
a
(
(chap02, 53), 1)
(
(chap18, 50), 1)
I am unsure of why its making a key value pair a key for a value 1 for each entry.
回答1:
I don't think you need a custom output format at all for this. So long as you pass the filename along to the reducer, you should be able to do this simply by modifying the String that you use within a TextOutputFormat type operation. Explanation is below.
In the mapper get the filename, and append it to a textInputFormat as below
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
context.write(key,new Text(fileName));
Then in the reducer do something like the following:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Integer> files = new HashMap<String, Integer>();
for (Text val : values) {
if (files.containsKey(val.toString())) {
files.put(val.toString(), files.get(val.toString()) + 1);
} else {
files.put(val.toString(), 1);
}
}
String outputString = key.toString();
for (String file : files.keySet()) {
outputString += "\n( " + file + ", " + files.get(file) + ")";
}
context.write(key, new Text(outputString));
}
This reducer appends "\n"
to the beginning of every line, in order to force the display formatting to be exactly what you want.
This seems much simpler than writing your own outputformat.
来源:https://stackoverflow.com/questions/29612503/how-to-create-a-custom-output-format-in-hadoop