Hadoop streaming with Python: Keeping track of line numbers
问题 I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python. I want to do it by using the TextInputFormat which passes file position keys and text values to the mappers. The problem is that Hadoop streaming automatically discards the file position keys, which are needed to preserve the ordering of the document. How can I retain the file position information of the input to the mappers? Or is there a better way to convert a