Tips for properly using large broadcast variables?

前端 未结 1 1421
被撕碎了的回忆
被撕碎了的回忆 2020-12-08 22:37

I\'m using a broadcast variable about 100 MB pickled in size, which I\'m approximating with:

>>> data = list(range(int(10*1e6)))
>>> import         


        
相关标签:
1条回答
  • 2020-12-08 23:11

    Well, the devil is in the detail. To understand the reason why this may happen we'll have to take a closer look at the PySpark serializers. First lets create SparkContext with default settings:

    from pyspark import SparkContext
    
    sc = SparkContext("local", "foo")
    

    and check what is a default serializer:

    sc.serializer
    ## AutoBatchedSerializer(PickleSerializer())
    
    sc.serializer.bestSize
    ## 65536
    

    It tells us three different things:

    • this is AutoBatchedSerializer serializer
    • it is using PickleSerializer to perform actual job
    • bestSize of the serialized batched is 65536 bytes

    A quick glance at the source code will show you that this serialize adjusts number of records serialized at the time on the runtime and tries to keep batch size less than 10 * bestSize. The important point is that not all records in the single partition are serialized at the same time.

    We can check that experimentally as follows:

    from operator import add
    
    bd = sc.broadcast({})
    
    rdd = sc.parallelize(range(10), 1).map(lambda _: bd.value)
    rdd.map(id).distinct().count()
    ## 1
    
    rdd.cache().count()
    ## 10
    
    rdd.map(id).distinct().count()
    ## 2
    

    As you can see even in this simple example after serialization-deserialization we get two distinct objects. You can observe similar behavior working directly with pickle:

    v = {}
    vs = [v, v, v, v]
    
    v1, *_, v4 = pickle.loads(pickle.dumps(vs))
    v1 is v4
    ## True
    
    (v1_, v2_), (v3_, v4_) = (
        pickle.loads(pickle.dumps(vs[:2])),
        pickle.loads(pickle.dumps(vs[2:]))
    )
    
    v1_ is v4_
    ## False
    
    v3_ is v4_
    ## True
    

    Values serialized in the same batch reference, after unpickling, the same object. Values from different batches point to different objects.

    In practice Spark multiple serializes and different serialization strategies. You can for example use batches of infinite size:

    from pyspark.serializers import BatchedSerializer, PickleSerializer
    
    rdd_ = (sc.parallelize(range(10), 1).map(lambda _: bd.value)
        ._reserialize(BatchedSerializer(PickleSerializer())))
    rdd_.cache().count()
    
    rdd_.map(id).distinct().count()
    ## 1
    

    You can change serializer by passing serializer and / or batchSize parameters to SparkContext constructor:

    sc = SparkContext(
        "local", "bar",
        serializer=PickleSerializer(),  # Default serializer
        # Unlimited batch size -> BatchedSerializer instead of AutoBatchedSerializer
        batchSize=-1  
    )
    
    sc.serializer
    ## BatchedSerializer(PickleSerializer(), -1)
    

    Choosing different serializers and batching strategies results in different trade-offs (speed, ability to serialize arbitrary objects, memory requirements, etc.).

    You should also remember that broadcast variables in Spark are not shared between executor threads so on the same worker can exist multiple deserialized copies at the same time.

    Moreover you'll see a similar behavior to this if you execute a transformation which requires shuffling.

    0 讨论(0)
提交回复
热议问题