Spark - repartition() vs coalesce()

前端未结
关注
 14  1952
误落风尘 2020-11-22 17:11
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

      
      
        
          14条回答        

        
                    
            
            
                         
                
              
              
                
                   孤城傲影
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 17:25
              

            
            
                        
All the answers are adding some great knowledge into this very often asked question.

So going by tradition of this question's timeline, here are my 2 cents.

I found the repartition to be faster than coalesce, in very specific case.

In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster. 

Here is what I mean

if(numFiles > 20)
    df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
    df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)


In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.

Of course, this number (20) will depend on the number of workers and amount of data.

Hope that helps.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它14个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复