spark-submit with specific python librairies

问题

I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos.

I do have a zipped version of my python environment that is on a http server reachable by my cluster.

I have some trouble to specify to my spark-submit query to use this environment. I use both --archives to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python' plus --conf 'spark.pyspark.python=path/to/my/env/bin/python' to specify the thing.

This does not seem to work... Do I do something wrong? Do you have any idea on how to do that?

Cheers, ALex

回答1:

To submit you zip folder to python spark, you need to send the files using :

spark-submit --py-files your_zip your_code.py

While using it inside your code, you will have to use below statement:

sc.addPyFile("your_zip")
import your_zip

Hope this will help!!

回答2:

May be helpful to some people, if you have dependencies.

I found a solution on how to properly load a virtual environment to the master and all the slave workers:

virtualenv venv --relocatable
cd venv 
zip -qr ../venv.zip *

PYSPARK_PYTHON=./SP/bin/python spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./SP/bin/python --driver-memory 4G --archives venv.zip#SP filename.py

来源：https://stackoverflow.com/questions/48644166/spark-submit-with-specific-python-librairies

标签

python

pyspark

spark-submit

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!