Hadoop: Output file has double output

六月ゝ 毕业季﹏ 提交于 2019-12-13 06:09:33

问题


I am running a Hadoop program and have the following as my input file, input.txt:

1
2

mapper.py:

import sys
for line in sys.stdin:
    print line,
print "Test"

reducer.py:

import sys
for line in sys.stdin:
    print line,

When I run it without Hadoop: $ cat ./input.txt | ./mapper.py | ./reducer.py, the output is as expected:

1
2
Test

However, running it through Hadoop via the streaming API (as described here), the latter part of the output seems somewhat "doubled":

1
2
Test    
Test

Aditionally, when I run the program through Hadoop, it seems like it has a 1/4 chance of failing due to this:

Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I've looked at this for some time and can't figure out what I'm not getting. If anyone could help with these issues, I would greatly appreciate it! Thanks.

edit: When input.txt is:

1
2
3
4
5
6
7
8
9
10

The output is:

1   
10  
2   
3   
4   
5   
6   
7   
8   
9   
Test    
Test

回答1:


It gives the same output. I guess you are specifying the location of reducer to mapper.py only. Make sure you are providing correct path to reducer.py



来源:https://stackoverflow.com/questions/19011036/hadoop-output-file-has-double-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!