Pandas-style transform of grouped data on PySpark DataFrame

后端未结

关注

 3  948

悲&欢浪女 2021-02-07 10:24

If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:

df[\"


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   半阙折子戏
                                             
                
                
                (楼主)
            
              
              
                2021-02-07 10:50
              

            
            
                        

  I understand, each category requires a full scan of the DataFrame. 


No it doesn't. DataFrame aggregations are performed using a logic similar to aggregateByKey. See DataFrame groupBy behaviour/optimization A slower part is join which requires sorting / shuffling. But it still doesn't require scan per group.

If this is an exact code you use it is slow because you don't provide a join expression. Because of that it simply performs a Cartesian product. So it is not only inefficient but also incorrect. You want something like this:

from pyspark.sql.functions import col

means = df.groupBy("Category").mean("Values").alias("means")
df.alias("df").join(means, col("df.Category") == col("means.Category"))



  I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF


It is possible although performance will vary on case by case basis. A problem with using Python UDFs is that it has to move data to and from Python. Still, it is definitely worth trying. You should consider using a broadcast variable for nameToMean though.


  Is there an idiomatic way to express this type of operation without sacrificing performance?


In PySpark 1.6 you can use broadcast function:

df.alias("df").join(
    broadcast(means), col("df.Category") == col("means.Category"))


but it is not available in <= 1.5.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复