问题
I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.
mapper.py:
#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords
#print stopwords.words('english')
for line in sys.stdin:
print line,
reducer.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line,
Console command:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
This runs perfectly, with the output simply containing the lines of the input file.
However, when this line (from mapper.py):
#print stopwords.words('english')
is uncommented, then the program fails and says
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
I have checked and in a standalone python program,
print stopwords.words('english')
works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.
I would greatly appreciate any help! Thank you
回答1:
Is 'english' a file in print stopwords.words('english')
? If yes, you need to use -file
for that too to send it across the nodes.
回答2:
Use these commands to unzip :
importer = zipimport.zipimporter('nltk.zip')
importer2=zipimport.zipimporter('yaml.zip')
yaml = importer2.load_module('yaml')
nltk = importer.load_module('nltk')
CHeck the links which I pasted above. They have mentioned all the steps.
来源:https://stackoverflow.com/questions/19057741/hadoop-and-nltk-fails-with-stopwords