Why is pyspark picking up a variable that was not broadcast?
问题 I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast. The variable in question is video , that's used in function filter , after the join. seed = random.randint(0,999) # df is a dataframe # video is just one randomly sampled element video = df.sample(False,0.001,seed).head() # just a python list otherVideos = [ (22,0.32),(213,0.43) ] # transform the python list into an rdd