Reading file in hadoop streaming

问题

I am trying to read an auxiliary file in my mapper and here are my codes and commands.

mapper code:

#!/usr/bin/env python

from itertools import combinations
from operator import itemgetter
import sys

storage = {}

with open('inputData', 'r') as inputFile:
    for line in inputFile:
         first, second = line.split()
         storage[(first, second)] = 0

for line in sys.stdin:
    do_something()

And here is my command:

hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result

But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():

回答1:

The error you are getting means your mapper failed to write to their stdout stream for too long.

For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.

Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.

There are two relatively easy ways to solve this:

(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.
(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds

mapreduce.reduce.shuffle.read.timeout

I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.

来源：https://stackoverflow.com/questions/35978467/reading-file-in-hadoop-streaming

标签

Hadoop

MapReduce