How can I include a python package with Hadoop streaming job?

后端未结

关注

 5  1106

情深已故 2020-11-27 13:47

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"

5条回答

谎友^ (楼主)

2020-11-27 14:14
Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

first create zip w/ the libraries desired
```
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
```
next, include via Hadoop stream "-file" argument:
```
hadoop -file nltkandyaml.zip
```
finally, load the libaries via python:
```
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk') 
```
Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

download and unzip the wordnet corpus
```
cd wordnet
zip -r ../wordnet-flat.zip *
```
in python:
```
wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...