Understanding Spark's caching

前端未结

关注

 3  1787

I\'m trying to understand how Spark\'s cache work.

Here is my naive understanding, please let me know if I\'m missing something:

val rdd1 = sc.textF


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2020-12-02 09:03
              
            
            
                                                                       
It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. 

This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.

persist function

unpersist function

So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.

The comments for the RDD.persist method hint towards this:
rdd.persist
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-02 09:13
              
            
            
                                                                       
Option B is an optimal approach with small tweak-in. Make use of less expensive action methods. In the approach mentioned by your code, saveAsTextFile is an expensive operation, replace it by count method.

Idea here is to remove the big rdd1 from DAG, if it's not relevant for further computation (after rdd2 and rdd3 are created)

Updated approach from code

val rdd1 = sc.textFile("some data").cache()
val rdd2 = rdd1.filter(...).cache() 
val rdd3 = rdd1.map(...).cache()

rdd2.count
rdd3.count

rdd1.unpersist()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-02 09:15
              
            
            
                                                                       
In option A, you have not shown when you are calling the action (call to save)

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")


If the sequence is as above, Option A should use cached version of rdd1 for computing both rdd2 and rdd 3
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复