How can I include a python package with Hadoop streaming job?

后端 未结 5 1106
情深已故
情深已故 2020-11-27 13:47

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"

5条回答
  •  谎友^
    谎友^ (楼主)
    2020-11-27 14:14

    Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

    first create zip w/ the libraries desired

    zip -r nltkandyaml.zip nltk yaml
    mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
    

    next, include via Hadoop stream "-file" argument:

    hadoop -file nltkandyaml.zip
    

    finally, load the libaries via python:

    import zipimport
    importer = zipimport.zipimporter('nltkandyaml.mod')
    yaml = importer.load_module('yaml')
    nltk = importer.load_module('nltk') 
    

    Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

    download and unzip the wordnet corpus

    cd wordnet
    zip -r ../wordnet-flat.zip *
    

    in python:

    wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
    

提交回复
热议问题