How does Spark interoperate with CPython

有些话、适合烂在心里 提交于 2019-12-21 03:42:16

问题


I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.


回答1:


PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)

The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.

Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:

  • Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.

  • Scala code that starts python process and iterates with it: api/python/PythonRDD.scala

  • SerDeser utils that do picking of the code: api/python/SerDeUtil.scala

  • Python side: python/pyspark/worker.py




回答2:


So Spark uses py4j to communicate between the JVM and Python. This allow Spark to work with different versions of Python but requires serializing data from the JVM and vice versa to communicate. There is more info on py4j at http://py4j.sourceforge.net/ , hope that helps :)



来源:https://stackoverflow.com/questions/30684982/how-does-spark-interoperate-with-cpython

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!