100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
Usi
I noticed the basic distinct function can be significantly faster when you run it on a RDD than running it on a DataFrame collection. For example:
DataFrame df = sqlContext.load(...) df.distinct.count // 0.8 s df.rdd.distinct.count // 0.2 s