spark dataframe groupping does not count nulls

后端未结

关注

 3  1644

I have a spark DataFrame which is grouped by a column aggregated with a count:

df.groupBy(\'a\').agg(count(\"a\")).show

+---------+----------------+
|a


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-07 05:06
              
            
            
                                                                       
Yes, count applied to a specific column does not count the null-values. If you want to include the null-values, use:

df.groupBy('a).agg(count("*")).show

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-07 05:07
              
            
            
                                                                       

  What is the reason for this behaviour?


SQL-92 standard. In particular (emphasis mine):


  Let T be the argument or argument source of a <set function specification>.





  If COUNT(*) is specified, then the result is the cardinality of T.





  Otherwise, let TX be the single-column table that is the result of applying the <value expression> to each row of T and eliminating null values.





  If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be
              TX.





  If the  COUNT is specified, then the
                result is the cardinality of TXA.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-12-07 05:07
              
            
            
                                                                       
value_counts(dropna=False) pyspark's equivalent:

from pyspark.sql import functions as f
df.groupBy('a').agg(f.count('*')).orderBy('count(1)',ascending=False).show()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复