100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
Usi
Sim gave a great discussion about the count distinct problem at the Spark Summit in Europe.
The HyperLogLog algo is the best for big count distinct computations that need to be updated incrementally.
The ApproxCountDistinct algos that Tagar linked to aren't the best because they don't expose the underlying HLL sketch & can't be reaggregated (Sim discusses this in his talk).
This blog post explains how to use the spark-alchemy library to create HLL sketches that are reaggregatable. Fun stuff!