How can I include a python package with Hadoop streaming job?

后端未结

关注

 5  1111

情深已故 2020-11-27 13:47

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"

5条回答

伪装坚强ぢ (楼主)

2020-11-27 14:02
If you are using much more complex libs such as numpy、pandas, virtualenv is a better way. You can add -archives to send the env to cluster.

Refer to the writing: https://henning.kropponline.de/2014/07/18/virtualenv-hadoop-streaming/

Updated:

I tried above virtualenv in our online env, and find some problems.In the cluster，there is some errors like "Could not find platform independent libraries "。Then i tried the conda to create python env, it worked well.

If you are Chinese, you can look this:https://blog.csdn.net/Jsin31/article/details/53495423

If not, i can translate it briefly:
1. create an env by conda：
  
  conda create -n test python=2.7.12 numpy pandas
2. Go to the conda env path.You can find it by cmd:
  
  conda env list
  
  Then,you can pack it:
  
  tar cf test.tar test
3. submit the job through hadoop stream：
```
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-archives test.tar \
-input /user/testfiles \
-output /user/result \ 
-mapper "test.tar/test/bin/python mapper.py" \
-file mapper.py \
-reducer"test.tar/test/bin/python reducer.py" \
-file reducer.py
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...