WordCount example with Count per file

问题

I am having an issue to get the breakdown of the total number of occurrences of words per file. for example, I have four text files (t1, t2, t3, t4). word w1 is twice in file t2, and once in t4, with total occurrences of three. I want to write the same information in output file. I am getting total number of words in each file, but can't get the result i want as above.

Here is my map class.

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
//line added
import org.apache.hadoop.mapreduce.lib.input.*;

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String pattern= "^[a-z][a-z0-9]*$";

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    //line added
    InputSplit inputSplit = context.getInputSplit();
    String fileName = ((FileSplit) inputSplit).getPath().getName();

    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        String stringWord = word.toString().toLowerCase();
        if ((stringWord).matches(pattern)){
            //context.write(new Text(stringWord), one);
            context.write(new Text(stringWord), one);
            context.write(new Text(fileName), one);
            //System.out.println(fileName);
            }
        }
    }
}

回答1:

This you can achieve by writing word as key and filename as value. Now in your reducer initialize the separate counter for each file and update them. Once all the values are iterated for a particular key, then write the counter for each file into the context.

Here you are aware that you have only four files, so you can hard code four variable. Remember, you need to reset the variables for each new key you process in the reducer.

In case if the number of files are more then you can use Map. In the map, filename will be key and keep on updating the value.

回答2:

In output of the mapper we can set the text file name as key and each row in the file as the value. This reducer gives you the file name the word and its corresponding count.

public class Reduce extends Reducer<Text, Text, Text, Text> {
    HashMap<String, Integer>input = new HashMap<String, Integer>();

    public void reduce(Text key, Iterable<Text> values , Context context)
    throws IOException, InterruptedException {
        int sum = 0;
        for(Text val: values){
            String word = val.toString(); -- processing each row
            String[] wordarray = word.split(' '); -- assuming the delimiter is a space
            for(int i=0 ; i<wordarray.length; i++)
           {
            if(input.get(wordarray[i]) == null){
            input.put(wordarray[i],1);}
            else{
             int value =input.get(wordarray[i]) +1 ; 
             input.put(wordarray[i],value);
             }
           }     

       context.write(new Text(key), new Text(input.toString()));
    }

来源：https://stackoverflow.com/questions/32969870/wordcount-example-with-count-per-file

标签

java

apache

Hadoop

MapReduce