Tips for properly using large broadcast variables?

前端未结

关注

 1  1425

I\'m using a broadcast variable about 100 MB pickled in size, which I\'m approximating with:

>>> data = list(range(int(10*1e6)))
>>> import


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-12-08 23:11
              
            
            
                                                                       
Well, the devil is in the detail. To understand the reason why this may happen we'll have to take a closer look at the PySpark serializers. First lets create SparkContext with default settings:

from pyspark import SparkContext

sc = SparkContext("local", "foo")


and check what is a default serializer:

sc.serializer
## AutoBatchedSerializer(PickleSerializer())

sc.serializer.bestSize
## 65536


It tells us three different things:


this is AutoBatchedSerializer serializer
it is using PickleSerializer to perform actual job
bestSize of the serialized batched is 65536 bytes 


A quick glance at the source code will show you that this serialize adjusts number of records serialized at the time on the runtime and tries to keep batch size less than 10 * bestSize. The important point is that not all records in the single partition are serialized at the same time.

We can check that experimentally as follows:

from operator import add

bd = sc.broadcast({})

rdd = sc.parallelize(range(10), 1).map(lambda _: bd.value)
rdd.map(id).distinct().count()
## 1

rdd.cache().count()
## 10

rdd.map(id).distinct().count()
## 2


As you can see even in this simple example after serialization-deserialization we get two distinct objects. You can observe similar behavior working directly with pickle:

v = {}
vs = [v, v, v, v]

v1, *_, v4 = pickle.loads(pickle.dumps(vs))
v1 is v4
## True

(v1_, v2_), (v3_, v4_) = (
    pickle.loads(pickle.dumps(vs[:2])),
    pickle.loads(pickle.dumps(vs[2:]))
)

v1_ is v4_
## False

v3_ is v4_
## True


Values serialized in the same batch reference, after unpickling, the same object. Values from different batches point to different objects.

In practice Spark multiple serializes and different serialization strategies. You can for example use batches of infinite size:

from pyspark.serializers import BatchedSerializer, PickleSerializer

rdd_ = (sc.parallelize(range(10), 1).map(lambda _: bd.value)
    ._reserialize(BatchedSerializer(PickleSerializer())))
rdd_.cache().count()

rdd_.map(id).distinct().count()
## 1


You can change serializer by passing serializer and / or batchSize parameters to SparkContext constructor:

sc = SparkContext(
    "local", "bar",
    serializer=PickleSerializer(),  # Default serializer
    # Unlimited batch size -> BatchedSerializer instead of AutoBatchedSerializer
    batchSize=-1  
)

sc.serializer
## BatchedSerializer(PickleSerializer(), -1)


Choosing different serializers and batching strategies results in different trade-offs (speed, ability to serialize arbitrary objects, memory requirements, etc.).

You should also remember that broadcast variables in Spark are not shared between executor threads so on the same worker can exist multiple deserialized copies at the same time.

Moreover you'll see a similar behavior to this if you execute a transformation which requires shuffling.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复