hadoop inverted-index without recurrence of file names

问题

what i have in output is:

word , file ----- ------ wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1

what i want is:

word , file ----- ------ wordx Doc2, Doc1

public static class LineIndexMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, Text> {

    private final static Text word = new Text();
    private final static Text location = new Text();

    public void map(LongWritable key, Text val,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
        FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        location.set(fileName);

        String line = val.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            output.collect(word, location);
        }
    }
}

public static class LineIndexReducer extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterator<Text> values,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {

        boolean first = true;
        StringBuilder toReturn = new StringBuilder();
        while (values.hasNext()) {
            if (!first) {
                toReturn.append(", ");
            }
            first = false;
            toReturn.append(values.next().toString());
        }

        output.collect(key, new Text(toReturn.toString()));
    }
}

for the best performance - where should i skip the recurring file name? map,reduce or both? ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.

回答1:

You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.

public void reduce(Text key, Iterator<Text> values,
        OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {

    // Text's equals() method should be overloaded to make this work
    Set<Text> outputValues = new HashSet<Text>();

    while (values.hasNext()) {
      // make a new Object because Hadoop may mess with original
      Text value = new Text(values.next());

      // takes care of removing duplicates
      outputValues.add(value);
    }

    boolean first = true;
    StringBuilder toReturn = new StringBuilder();
    Iterator<Text> outputIter = outputValues.iter();
    while (outputIter.hasNext()) {
        if (!first) {
            toReturn.append(", ");
        }
        first = false;
        toReturn.append(outputIter.next().toString());
    }

    output.collect(key, new Text(toReturn.toString()));
}

Edit: Adds copy of value to Set as per Chris' comment.

回答2:

You can improve performance by doing local map aggregation and introducing a combiner - basically you want to reduce the amount of data being transmitted between your mappers and reducers

Local map aggregation is a concept where by you maintain a LRU like map (or set) of output pairs. In your case a set of words for the current mapper document (assuming you have a single document per map). This way you can lookup the word in the set, and only output a K,V pair if the set doesn't already contain that word (indicating you haven't already output an entry for it). If the set doesn't contain the word, output the word, docid pair, and update the set with the word.

If the set get's too big (say 5000 or 10000 entries), then clear it out and start over. This way you'll see the number of values output from the mapper dramatically (if your value domain or set of values is small, words are a good example for this).

You can also introduce your reducer logic in the combiner phase too

Once final word of warning - be vary careful about adding the Key / Value objects into sets (like in Matt D's answer), hadoop re-uses objects under the hood, so don't be surprised if you get unexpected results if you add in the references - always create a copy of the object.

There's an article on local map aggregation (for the word count example) that you may find useful:

http://wikidoop.com/wiki/Hadoop/MapReduce/Mapper#Map_Aggregation

来源：https://stackoverflow.com/questions/10305435/hadoop-inverted-index-without-recurrence-of-file-names

标签

Hadoop

inverted-index