Efficient Count Distinct with Apache Spark

后端 未结 8 1342
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-31 15:26

    Spark 2.0 added ApproxCountDistinct into dataframe and SQL APIs:

    https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html

    https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#approxCountDistinct(org.apache.spark.sql.Column)

提交回复
热议问题