Efficient Count Distinct with Apache Spark

﹥>﹥吖頭↗ 提交于 2019-12-03 05:30:34

问题


100 million customers click 100 billion times on the pages of a few web sites (let's say 100 websites). And the click stream is available to you in a large dataset.

Using the abstractions of Apache Spark, what is the most efficient way to count distinct visitors per website?


回答1:


visitors.distinct().count() would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and use the same methods as on the RDD like:

val file = ssc.textFileStream("...") file.distinct().count()

Last option is to use def countApproxDistinct(relativeSD: Double = 0.05): Long however this is labelled as experimental, but would be significantly faster than count if relativeSD (std deviation) is higher.

EDIT: Since you want the count per website you can just reduce on the website id, this can be done efficiently (with combiners ) since count is aggregate. If you have an RDD of website name user id tuples you can do. visitors.countDistinctByKey() or visitors.countApproxDistinctByKey(), once again the approx one is experimental. To use approx distinct by key you need a PairRDD

Interesting side note if you are ok with approximations and want fast results you might want to look into blinkDB made by the same people as spark amp labs.




回答2:


I've had to do similar things, one efficiency thing you can do (that isn't really spark) is map your vistor IDs to lists of bytes rather than GUID Strings, you can save 4x space then (as 2 Chars is hex encoding of a single byte, and a Char uses 2 bytes in a String).

// Inventing these custom types purely for this question - don't do this in real life!
type VistorID = List[Byte]
type WebsiteID = Int

val visitors: RDD[(WebsiteID, VisitorID)] = ???

visitors.distinct().mapValues(_ => 1).reduceByKey(_ + _)

Note you could also do:

visitors.distinct().map(_._1).countByValue()

but this doesn't scale as well.




回答3:


I noticed the basic distinct function can be significantly faster when you run it on a RDD than running it on a DataFrame collection. For example:

DataFrame df = sqlContext.load(...)
df.distinct.count // 0.8 s
df.rdd.distinct.count // 0.2 s



回答4:


If data is an RDD of (site,visitor) pairs, then data.countApproxDistinctByKey(0.05) will give you an RDD of (site,count). The parameter can be reduced to get more accuracy at the cost of more processing.




回答5:


Spark 2.0 added ApproxCountDistinct into dataframe and SQL APIs:

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#approxCountDistinct(org.apache.spark.sql.Column)




回答6:


If you want it per webpage, then visitors.distinct()... is inefficient. If there are a lot of visitors and a lot of webpages, then you're distincting over a huge number of (webpage, visitor) combinations, which can overwhelm the memory.

Here is a another way:

visitors.groupByKey().map { 
  case (webpage, visitor_iterable)
  => (webpage, visitor_iterable.toArray.distinct.length)
}

This requires that the visitors to a single webpage fits in memory, so may not be best in all cases.



来源:https://stackoverflow.com/questions/24312113/efficient-count-distinct-with-apache-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!