Hadoop and NLTK: Fails with stopwords

你。 提交于 2019-12-13 02:26:16

问题


I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py:

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

Console command:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

This runs perfectly, with the output simply containing the lines of the input file.

However, when this line (from mapper.py):

#print stopwords.words('english')

is uncommented, then the program fails and says

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I have checked and in a standalone python program,

print stopwords.words('english')

works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.

I would greatly appreciate any help! Thank you


回答1:


Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.




回答2:


Use these commands to unzip :

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

CHeck the links which I pasted above. They have mentioned all the steps.



来源:https://stackoverflow.com/questions/19057741/hadoop-and-nltk-fails-with-stopwords

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!