Pyspark: run a script from inside the archive

限于喜欢 提交于 2020-08-09 07:15:52

问题


I have an archive (basically a bundled conda environment + my application) which I can easily use with pyspark in yarn master mode:

PYSPARK_PYTHON=./pkg/venv/bin/python3 \ 
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pkg/venv/bin/python3 \
--master yarn \
--deploy-mode cluster \
--archives hdfs:///package.tgz#pkg \
app/MyScript.py

This works as expected, no surprise here.

Now how could I run this if MyScript.py is inside package.tgz. not on my local filesystem?

I would like to replace the last line of my command with eg. ./pkg/app/MyScript.py but then spark complains: java.io.FileNotFoundException: File file:/home/blah/pkg/app/MyScript.py does not exist.

I could of course extract it first, put it separately on hdfs... There are workarounds but as I have everything in one nice place, I would love to use it.

If it's relevant, this is spark 2.4.0, python 3.7, on CDH.


回答1:


As I understand it, you cannot: you must supply a Python script to spark-submit.

But you can have a very short script and use --py-files to distribute a ZIP or EGG of the rest of your code:

# go.py

from my.app import run

run()
# my/app.py

def run():
  print("hello")

You can create a ZIP file containing the my directory and submit that with the short entry point script: spark-submit --py-files my.zip go.py

If you like, you can make a generic go.py that accepts arguments telling it which module and method to import and run.



来源:https://stackoverflow.com/questions/62431781/pyspark-run-a-script-from-inside-the-archive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!