I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"
An example of loading external python package nltk
refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.
Assumption, you have already your package or (nltk in my case)in your system
first:
zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.
second:
changes in your mapper or .py file
#Hadoop cannot unzip files by default thus you need to unzip it
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]
third: command line argument to run map-reduce
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
Thus, above step solved my problem and I think it should solve others as well.
cheers,