Spark DataFrame: count distinct values of every column

前端 未结 5 2100
野的像风
野的像风 2020-11-27 04:48

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?

The describe method provides only th

5条回答
  •  天涯浪人
    2020-11-27 05:40

    In pySpark you could do something like this, using countDistinct():

    from pyspark.sql.functions import col, countDistinct
    
    df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
    

    Similarly in Scala :

    import org.apache.spark.sql.functions.countDistinct
    import org.apache.spark.sql.functions.col
    
    df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
    

    If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

提交回复
热议问题