Spark: Order of column arguments in repartition vs partitionBy

后端未结
关注
 2  1705
时光说笑 2021-01-02 05:40
Methods taken into consideration (Spark 2.2.1):
DataFrame.repartition (the two implementations that take partitionExprs: Column

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   悲&欢浪女
                                             
                
                
                (楼主)
            
              
              
                2021-01-02 06:00
              

            
            
                        
The only similarity between these two methods are their names. There are used for different things and have different mechanics so you shouldn't compare them at all.

That being said, repartition shuffles data using:


With partitionExprs it uses hash partitioner on the columns used in the expression using spark.sql.shuffle.partitions.
With partitionExprs and numPartitions it does the same as the previous one, but overriding spark.sql.shuffle.partitions.
With numPartitions it just rearranges data using RoundRobinPartitioning.



  the order of column inputs relevant in repartition method too?


It is. hash((x, y)) is in general not the same as hash((y, x)).

df = (spark.range(5, numPartitions=4).toDF("x")
    .selectExpr("cast(x as string)")
    .crossJoin(spark.range(5, numPartitions=4).toDF("y")))

df.repartition(4, "y", "x").rdd.glom().map(len).collect()


[8, 6, 9, 2]


df.repartition(4, "x", "y").rdd.glom().map(len).collect()


[6, 4, 3, 12]



  Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns?


Depending on what is the exact question.


Yes. GROUP BY with the same set of columns will result in the same logical distribution of keys over partitions.
No. Hash partitioner can map multiple keys to the same partition. GROUP BY "sees" only the actual groups.


Related How to define partitioning of DataFrame?
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复