Hadoop streaming mapper byte offset not being generated

假如想象 提交于 2019-12-01 08:27:52

问题


I'm running a streaming Hadoop job and the byte offsets are not being generated as output (keys) of the mapper, like I would expect it too. The command:

$HADOOP_INSTALL/bin/hadoop \
jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-$HADOOP_VERSION.jar \
-D stream.map.input.ignoreKey=false \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-file ./mapper.py \
-file ./reducer.py \
-mapper ./mapper.py \
-reducer ./reducer.py \
-input $INPUT_DIR \
-output $OUTPUT_DIR \
-cmdenv REGEX=$REGEX

My understanding is that TextInputFormat is the default, so I also tried the above command without the -inputformat option. I've also tried removing the -D, but I'm told that this is required to get the byte offset as key when using the streaming API.

For what it's worth, I'm just experimenting with Hadoop for a student project. At the moment, the mapper is a very simple python grep of a file in HDFS, matching each line against the supplied regex:

pattern = re.compile(os.environ['REGEX'])
for line in sys.stdin:
   match = pattern.search(line)
   if (match):
      sys.stdout.write(line)

Right now though, the only thing that's output (to the reducer) is the matching lines. I'm expecting tab or whitespace delimited key/value pairs, where key=byte_offset and value=regex_line_match.

Can anyone tell me or suggest why this is happening?

Also, I'm just as interested in answering these two (related) questions:

  1. Is it possible for a mapper to manually determine the byte offset for each line of the data it is processing relative to the file which the data belongs to?
  2. Is is possible for a mapper to determine the total number of bytes in the file to which the data it is processing belongs?

If yes to either of these questions, how? (python, or streaming in general).

Edit:
If I use -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat then the byte offsets are produced as keys of the mapper output. But the job takes a really long time to complete (and my input file only has about 50 lines of text in it!).

来源:https://stackoverflow.com/questions/15600495/hadoop-streaming-mapper-byte-offset-not-being-generated

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!