How to compute percentiles in Apache Spark

后端 未结 10 519
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  一整个雨季
    2020-12-04 22:49

    How about t-digest?

    https://github.com/tdunning/t-digest

    A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

    The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

    In summary, the particularly interesting characteristics of the t-digest are that it

    • has smaller summaries than Q-digest
    • works on doubles as well as integers.
    • provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
    • is fast
    • is very simple
    • has a reference implementation that has > 90% test coverage
    • can be used with map-reduce very easily because digests can be merged

    It should be fairly easy to use the reference Java implementation from Spark.

提交回复
热议问题