How to create a custom output format in Hadoop

问题

I am trying to create a variation of the word count hadoop program in which it reads multiple files in a directory and outputs the frequency of each word. The thing is, I want it to output a word followed by the file name is came from and the frequency from that file. for example:

word1
( file1, 10)
( file2, 3)
( file3, 20)

So for word1 (say the word "and"). It finds it 10 times is file1, 3 times in file2, ect. Right now it is outputing only a key value pair

 StringTokenizer itr = new StringTokenizer(chapter);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());

    context.write(word, one);

I can get the file name by

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

But I do not understand how to format the way I want. I've been looking into OutputCollector, but I am unsure of how to use it exactly.

EDIT: This is my mapper and recuder

public static class TokenizerMapper
   extends Mapper<Object, Text, Text, Text>{ 

private Text word = new Text();

public void map(Object key, Text value, Context context
                ) throws IOException, InterruptedException {

  //Take out all non letters and make all lowercase
  String chapter = value.toString();
  chapter = chapter.toLowerCase();
  chapter = chapter.replaceAll("[^a-z]"," ");

  //This is the file name
  String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

  StringTokenizer itr = new StringTokenizer(chapter);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());

   context.write(word, new Text(fileName)); //
  }
}
  }


  public static class IntSumReducer
       extends Reducer<Text,Text,Text,Text> { second


   public void reduce(Text key, Iterable<Text> values, Context context)
         throws IOException, InterruptedException {

  Map<String, Integer> files = new HashMap<String, Integer>();

 for (Text val : values) {
    if (files.containsKey(val.toString())) {
        files.put(val.toString(), files.get(val.toString())+1);
    } else {
        files.put(val.toString(), 1); 
    }
}

String outputString="";

for (String file : files.keySet()) { 
    outputString = outputString + "\n<" + file + ", " + files.get(file) + ">"; //files.get(file)
}

context.write(key, new Text(outputString));
}

  }

This is outputting for the word "a" for example:

a   
(
(chap02, 53), 1)
(
(chap18, 50), 1)

I am unsure of why its making a key value pair a key for a value 1 for each entry.

回答1:

I don't think you need a custom output format at all for this. So long as you pass the filename along to the reducer, you should be able to do this simply by modifying the String that you use within a TextOutputFormat type operation. Explanation is below.

In the mapper get the filename, and append it to a textInputFormat as below

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
context.write(key,new Text(fileName));

Then in the reducer do something like the following:

public void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
    Map<String, Integer> files = new HashMap<String, Integer>();
    for (Text val : values) {
        if (files.containsKey(val.toString())) {
            files.put(val.toString(), files.get(val.toString()) + 1);
        } else {
            files.put(val.toString(), 1);
        }
    }

    String outputString = key.toString();

    for (String file : files.keySet()) {
        outputString += "\n( " + file + ", " + files.get(file) + ")";
    }

    context.write(key, new Text(outputString));
}

This reducer appends "\n" to the beginning of every line, in order to force the display formatting to be exactly what you want.

This seems much simpler than writing your own outputformat.

来源：https://stackoverflow.com/questions/29612503/how-to-create-a-custom-output-format-in-hadoop

标签

java

Hadoop

output