PySpark serializing the 'self' referenced object in map lambdas?

北慕城南 提交于 2019-12-10 10:39:43

问题


As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)

How does this work when using PySpark? If I have a class as follows:

class C0(object):

  def func0(arg):
    ...

  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))

Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?

Thanks.


回答1:


This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.

In order to avoid it, do something like:

class C0(object):

  def func0(self, arg): # added self
    ...

  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))

Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.



来源:https://stackoverflow.com/questions/36508685/pyspark-serializing-the-self-referenced-object-in-map-lambdas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!