Get Line number in map method using FileInputFormat

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-02 10:19:37

问题


I was wondering whether it is possible to get the line number in my map method? My input file is just a single column of values like,

Apple
Orange
Banana

Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method?

Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option. Thanks ahead.


回答1:


The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.

You could create your own InputFormat (based upon the TextInputFormat and associated LineRecordReader) to produce line numbers rather than byte offsets but you'd need to configure your input format to return false from the isSplittable method (meaning that a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this shouldn't be a problem. Also non-splittable compression formats (GZip .gz for example) means the entire file will be processed by a single mapper anyway.



来源:https://stackoverflow.com/questions/15543827/get-line-number-in-map-method-using-fileinputformat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!