spark dataframe groupping does not count nulls

后端 未结 3 1644
自闭症患者
自闭症患者 2020-12-07 04:13

I have a spark DataFrame which is grouped by a column aggregated with a count:

df.groupBy(\'a\').agg(count(\"a\")).show

+---------+----------------+
|a             


        
相关标签:
3条回答
  • 2020-12-07 05:06

    Yes, count applied to a specific column does not count the null-values. If you want to include the null-values, use:

    df.groupBy('a).agg(count("*")).show
    
    0 讨论(0)
  • 2020-12-07 05:07

    What is the reason for this behaviour?

    SQL-92 standard. In particular (emphasis mine):

    Let T be the argument or argument source of a <set function specification>.

    If COUNT(*) is specified, then the result is the cardinality of T.

    Otherwise, let TX be the single-column table that is the result of applying the <value expression> to each row of T and eliminating null values.

    If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.

    If the COUNT is specified, then the result is the cardinality of TXA.

    0 讨论(0)
  • 2020-12-07 05:07

    value_counts(dropna=False) pyspark's equivalent:

    from pyspark.sql import functions as f
    df.groupBy('a').agg(f.count('*')).orderBy('count(1)',ascending=False).show()
    
    0 讨论(0)
提交回复
热议问题