Efficient Count Distinct with Apache Spark

后端 未结 8 1470
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  没有蜡笔的小新
    2021-01-31 15:04

    Sim gave a great discussion about the count distinct problem at the Spark Summit in Europe.

    The HyperLogLog algo is the best for big count distinct computations that need to be updated incrementally.

    The ApproxCountDistinct algos that Tagar linked to aren't the best because they don't expose the underlying HLL sketch & can't be reaggregated (Sim discusses this in his talk).

    This blog post explains how to use the spark-alchemy library to create HLL sketches that are reaggregatable. Fun stuff!

提交回复
热议问题