Why is pyspark picking up a variable that was not broadcast?

问题

I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast.

The variable in question is video, that's used in function filter, after the join.

seed = random.randint(0,999)

# df is a dataframe
# video is just one randomly sampled element
video = df.sample(False,0.001,seed).head()

# just a python list
otherVideos = [ (22,0.32),(213,0.43) ]

# transform the python list into an rdd 
resultsRdd = sc.parallelize(similarVideos)

rdd = df.rdd.map(lambda row: (row.video_id,row.title))

# perform a join between resultsRdd and rdd
# note that video.title was NOT broadcast
(resultsRdd
   .join(rdd)
   .filter(lambda pair: pair[1][1] != video.title) # HERE!!!
   .takeOrdered(10, key= lambda pair: -pair[1][0]))

I'm using pyspark in standalone mode, with the following arguments to pyspark-submit:

--num-executors 12 --executor-cores 4 --executor-memory 1g --master local[*]

Also, I'm running the previous code on jupyter (new ipython-notebooks).

回答1:

[Reposting comment as an answer.]

For this concept, I think this link on understanding closures is a pretty good read. Essentially, you do not need to broadcast all variables outside the scope of an RDD since the closure (in your case video) will be serialized and sent to each executor and task for access during task execution. Broadcast variables are useful when the dataset being broadcast is large because it will exist as a read-only cache that will sit on the executor and not be serialized/sent/deserialized with each task run on that executor.

来源：https://stackoverflow.com/questions/33337446/why-is-pyspark-picking-up-a-variable-that-was-not-broadcast

标签

apache-spark

distributed-computing

pyspark

spark-dataframe