How to process Header and Trailer in MapReduce

问题

How to process the Header and Trailer in the file? After processing these lines, it should be removed from the file.

The Header line can be found by the offset value 0 and the same trailer the max offset. But the issue here is how we can get both these lines in one mapper?

Appreciate your help..

Regards, Mohammed Niaz

回答1:

It is possible when we have only one mapper for the given input file.

We can process Header and Trailer records in below three options

Write a custom InputFormat file and extend FileInputFormat. In custom InputFormat override isSplitable() method and return 'false' thus MR framework wont split the file content and pass whole content to one mapper class.
Make the hdfs block size greater than the file size(but not recommended). So the whole file content would be available to one mapper class.
Whole file content would be available to one mapper if input file is Gzipped(i.e compressed).

Would welcome any comments or suggestions.

回答2:

While the TextInputFormat class (and others, but I don't have a complete list) in a MapReduce job can’t give you the line number of the record that you’re processing, it CAN give you the byte offset via the Key. Apparently, it's used to produce unique keys. The following code removes the first record of input (aka, the header record)…

 while (context.nextKeyValue()) {
        if (context.getCurrentKey().get() != 0L) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        } else {
            System.out.println(" Skipping the header: " + context.getCurrentValue());
        }
    }

来源：https://stackoverflow.com/questions/25226052/how-to-process-header-and-trailer-in-mapreduce

标签

MapReduce

HDFS