Running external python lib like (NLTK) with hadoop streaming

问题

I tried using http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')

And the error I got is:

job_201406080403_3863/attempt_201406080403_3863_m_000000_0/work/./app/mapper.py", line 12, in import nltk ImportError: No module named nltk

Anybody who did face a similar problem, can you please put a exhaustive solution.

Thanks

回答1:

I followed following approach and ran the nltk package in with hadoop streaming successfully.

note: I had only used nltk package not yaml, so my answer will only focus on loading nltk package not yaml, but I believe it should work for your question as well.

Assumption, you have already nltk package installed in your system

first:

zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod

Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.

second:
changes in your mapper or .py file

#Hadoop cannot unzip files by default thus you need to unzip it   
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')

#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]

third: and most the important one I guess you might be missing is

command line argument to run map-reduce

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/

Thus, above step solved my problem and I think it should solve others as well.
cheers,

来源：https://stackoverflow.com/questions/24167933/running-external-python-lib-like-nltk-with-hadoop-streaming

标签

python

Hadoop

nltk

hadoop-streaming