Efficient Count Distinct with Apache Spark

后端 未结 8 1347
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  自闭症患者
    2021-01-31 15:16

    I noticed the basic distinct function can be significantly faster when you run it on a RDD than running it on a DataFrame collection. For example:

    DataFrame df = sqlContext.load(...)
    df.distinct.count // 0.8 s
    df.rdd.distinct.count // 0.2 s
    

提交回复
热议问题