Spark - repartition() vs coalesce()

前端未结
关注
 14  1953
误落风尘 2020-11-22 17:11
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

      
      
        
          14条回答        

        
                    
            
            
                         
                
              
              
                
                   无人及你
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 17:31
              

            
            
                        
I would like to add to Justin and Power's answer that - 

repartition will ignore existing partitions and create new ones. So you can use it to fix data skew. You can mention partition keys to define the distribution. Data skew is one of the biggest problems in the 'big data' problem space.

coalesce will work with existing partitions and shuffle a subset of them. It can't fix the data skew as much as repartition does. Therefore even if it is less expensive it might not be the thing you need.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它14个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复