问题
I tried using http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
And the error I got is:
job_201406080403_3863/attempt_201406080403_3863_m_000000_0/work/./app/mapper.py", line 12, in import nltk ImportError: No module named nltk
Anybody who did face a similar problem, can you please put a exhaustive solution.
Thanks
回答1:
I followed following approach and ran the nltk package in with hadoop streaming successfully.
note: I had only used nltk package not yaml, so my answer will only focus on loading nltk package not yaml, but I believe it should work for your question as well.
Assumption, you have already nltk package installed in your system
first:
zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.
second:
changes in your mapper or .py file
#Hadoop cannot unzip files by default thus you need to unzip it
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]
third: and most the important one I guess you might be missing is
command line argument to run map-reduce
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
Thus, above step solved my problem and I think it should solve others as well.
cheers,
来源:https://stackoverflow.com/questions/24167933/running-external-python-lib-like-nltk-with-hadoop-streaming