问题
I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers.
Let's say I have this setup:
def main():
sc = SparkContext()
SomeMethod(sc)
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value ### NameError: global name 'V' is not defined ###
However, if I instead eliminate the SomeMethod()
middleman, it works fine.
def main():
sc = SparkContext()
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value # works just fine
I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers?
Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?
回答1:
I am not sure I completely understood the question but, if you need the V
object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained:
def worker(V, element):
element *= V.value
Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function:
from functools import partial
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(partial(worker, V=V))
来源:https://stackoverflow.com/questions/26959221/pyspark-broadcast-variables-from-local-functions