How to run a function on all Spark workers before processing data in PySpark?

后端未结

关注

 2  1496

南方客 2020-12-03 01:24

I\'m running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a \"setup\"

2条回答

死守一世寂寞 (楼主)

2020-12-03 01:57

If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:

some_path = ...  # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)

and retrieve it on the workers using SparkFiles.get and standard IO tools:

from pyspark import SparkFiles

with open(SparkFiles.get(some_path)) as fw:
    ... # Do something

If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:

model.py:

from pyspark import SparkFiles

config = ...
class MyClassifier:
    clf = None

    @staticmethod
    def is_loaded():
        return MyClassifier.clf is not None

    @staticmethod
    def load_models(config):
        path = SparkFiles.get(config.get("model_file"))
        MyClassifier.clf = load_from_file(path)

# Executed once per interpreter 
MyClassifier.load_models(config)

main.py:

from pyspark import SparkContext

config = ...

sc = SparkContext("local", "foo")

# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")

import model

ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()

ssc.start()
ssc.awaitTermination()

0 讨论(0)

查看其它2个回答