How to process Header and Trailer in MapReduce

天大地大妈咪最大 提交于 2019-12-12 04:57:55

问题


How to process the Header and Trailer in the file? After processing these lines, it should be removed from the file.

The Header line can be found by the offset value 0 and the same trailer the max offset. But the issue here is how we can get both these lines in one mapper?

Appreciate your help..

Regards, Mohammed Niaz


回答1:


It is possible when we have only one mapper for the given input file.

We can process Header and Trailer records in below three options

  1. Write a custom InputFormat file and extend FileInputFormat. In custom InputFormat override isSplitable() method and return 'false' thus MR framework wont split the file content and pass whole content to one mapper class.
  2. Make the hdfs block size greater than the file size(but not recommended). So the whole file content would be available to one mapper class.
  3. Whole file content would be available to one mapper if input file is Gzipped(i.e compressed).

Would welcome any comments or suggestions.




回答2:


While the TextInputFormat class (and others, but I don't have a complete list) in a MapReduce job can’t give you the line number of the record that you’re processing, it CAN give you the byte offset via the Key. Apparently, it's used to produce unique keys. The following code removes the first record of input (aka, the header record)…

 while (context.nextKeyValue()) {
        if (context.getCurrentKey().get() != 0L) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        } else {
            System.out.println(" Skipping the header: " + context.getCurrentValue());
        }
    }


来源:https://stackoverflow.com/questions/25226052/how-to-process-header-and-trailer-in-mapreduce

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!