How can I include a python package with Hadoop streaming job?

后端 未结 5 1108
情深已故
情深已故 2020-11-27 13:47

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"

5条回答
  •  我在风中等你
    2020-11-27 13:59

    An example of loading external python package nltk
    refer to the answer
    Running extrnal python lib like (NLTK) with hadoop streaming
    I followed following approach and ran the nltk package in with hadoop streaming successfully.

    Assumption, you have already your package or (nltk in my case)in your system

    first:

    zip -r nltk.zip nltk
    mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
    

    Why any where will work?
    Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.

    second:
    changes in your mapper or .py file

    #Hadoop cannot unzip files by default thus you need to unzip it   
    import zipimport
    importer = zipimport.zipimporter('nltk.mod')
    nltk = importer.load_module('nltk')
    
    #now import what ever you like from nltk
    from nltk import tree
    from nltk import load_parser
    from nltk.corpus import stopwords
    nltk.data.path += ["."]
    

    third: command line argument to run map-reduce

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -file /your/path/to/mapper/mapper.py \
    -mapper '/usr/local/bin/python3.4 mapper.py' \
    -file /your/path/to/reducer/reducer.py \
    -reducer '/usr/local/bin/python3.4 reducer.py' \
    -file /your/path/to/nltkzippedmodfile/nltk.mod \
    -input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
    

    Thus, above step solved my problem and I think it should solve others as well.
    cheers,

提交回复
热议问题