null value and countDistinct with spark dataframe

问题

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

When I apply a countDistinct on this dataframe, I find different results depending on the method:

First method

  df.distinct().count()

2

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

Second Method

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

回答1:

countDistinct works the same way as Hive count(DISTINCT expr[, expr]):

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

The first row is not included. This is common for SQL functions.

来源：https://stackoverflow.com/questions/40345117/null-value-and-countdistinct-with-spark-dataframe

标签

apache-spark

pyspark

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!