Hadoop streaming mapper byte offset not being generated

问题

I'm running a streaming Hadoop job and the byte offsets are not being generated as output (keys) of the mapper, like I would expect it too. The command:

$HADOOP_INSTALL/bin/hadoop \
jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-$HADOOP_VERSION.jar \
-D stream.map.input.ignoreKey=false \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-file ./mapper.py \
-file ./reducer.py \
-mapper ./mapper.py \
-reducer ./reducer.py \
-input $INPUT_DIR \
-output $OUTPUT_DIR \
-cmdenv REGEX=$REGEX

My understanding is that TextInputFormat is the default, so I also tried the above command without the -inputformat option. I've also tried removing the -D, but I'm told that this is required to get the byte offset as key when using the streaming API.

For what it's worth, I'm just experimenting with Hadoop for a student project. At the moment, the mapper is a very simple python grep of a file in HDFS, matching each line against the supplied regex:

pattern = re.compile(os.environ['REGEX'])
for line in sys.stdin:
   match = pattern.search(line)
   if (match):
      sys.stdout.write(line)

Right now though, the only thing that's output (to the reducer) is the matching lines. I'm expecting tab or whitespace delimited key/value pairs, where key=byte_offset and value=regex_line_match.

Can anyone tell me or suggest why this is happening?

Also, I'm just as interested in answering these two (related) questions:

Is it possible for a mapper to manually determine the byte offset for each line of the data it is processing relative to the file which the data belongs to?
Is is possible for a mapper to determine the total number of bytes in the file to which the data it is processing belongs?

If yes to either of these questions, how? (python, or streaming in general).

Edit:
If I use -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat then the byte offsets are produced as keys of the mapper output. But the job takes a really long time to complete (and my input file only has about 50 lines of text in it!).

来源：https://stackoverflow.com/questions/15600495/hadoop-streaming-mapper-byte-offset-not-being-generated

标签

python

Hadoop

MapReduce

hadoop-streaming

mapper