问题
I was wondering whether it is possible to get the line number in my map method? My input file is just a single column of values like,
Apple Orange Banana
Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method?
Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option. Thanks ahead.
回答1:
The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.
You could create your own InputFormat (based upon the TextInputFormat
and associated LineRecordReader
) to produce line numbers rather than byte offsets but you'd need to configure your input format to return false from the isSplittable
method (meaning that a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this shouldn't be a problem. Also non-splittable compression formats (GZip .gz for example) means the entire file will be processed by a single mapper anyway.
来源:https://stackoverflow.com/questions/15543827/get-line-number-in-map-method-using-fileinputformat