Efficient Count Distinct with Apache Spark

后端未结

关注

 8  1347

盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答

自闭症患者 (楼主)

2021-01-31 15:16
I noticed the basic distinct function can be significantly faster when you run it on a RDD than running it on a DataFrame collection. For example:
```
DataFrame df = sqlContext.load(...)
df.distinct.count // 0.8 s
df.rdd.distinct.count // 0.2 s
```
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...