Why is pyspark picking up a variable that was not broadcast?

安稳与你 提交于 2019-12-12 03:24:41

问题


I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast.

The variable in question is video, that's used in function filter, after the join.

seed = random.randint(0,999)

# df is a dataframe
# video is just one randomly sampled element
video = df.sample(False,0.001,seed).head()

# just a python list
otherVideos = [ (22,0.32),(213,0.43) ]

# transform the python list into an rdd 
resultsRdd = sc.parallelize(similarVideos)

rdd = df.rdd.map(lambda row: (row.video_id,row.title))

# perform a join between resultsRdd and rdd
# note that video.title was NOT broadcast
(resultsRdd
   .join(rdd)
   .filter(lambda pair: pair[1][1] != video.title) # HERE!!!
   .takeOrdered(10, key= lambda pair: -pair[1][0]))

I'm using pyspark in standalone mode, with the following arguments to pyspark-submit:

--num-executors 12 --executor-cores 4 --executor-memory 1g --master local[*]

Also, I'm running the previous code on jupyter (new ipython-notebooks).


回答1:


[Reposting comment as an answer.]

For this concept, I think this link on understanding closures is a pretty good read. Essentially, you do not need to broadcast all variables outside the scope of an RDD since the closure (in your case video) will be serialized and sent to each executor and task for access during task execution. Broadcast variables are useful when the dataset being broadcast is large because it will exist as a read-only cache that will sit on the executor and not be serialized/sent/deserialized with each task run on that executor.



来源:https://stackoverflow.com/questions/33337446/why-is-pyspark-picking-up-a-variable-that-was-not-broadcast

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!