How to compute percentiles in Apache Spark

后端 未结 10 515
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

相关标签:
10条回答
  • 2020-12-04 22:58

    If N percent is small like 10, 20% then I will do the following:

    1. Compute the size of dataset, rdd.count(), skip it maybe you know it already and take as argument.

    2. Rather then sorting the whole dataset, I will find out top(N) from each partition. For that I would have to find out N = what is N% of rdd.count, then sort the partitions and take top(N) from each partition. Now you have a much smaller dataset to sort.

    3.rdd.sortBy

    4.zipWithIndex

    5.filter (index < topN)

    0 讨论(0)
  • 2020-12-04 22:59

    Here is my easy approach:

    val percentiles = Array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)
    val accuracy = 1000000
    df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
    

    output:

    scala> df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
    res88: Array[Double] = Array(0.011044141836464405, 0.02022990956902504, 0.0317261666059494, 0.04638145491480827, 0.06498630344867706, 0.0892181545495987, 0.12161539494991302, 0.16825592517852783, 0.24740923941135406, 0.9188197255134583)
    

    accuracy: The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

    0 讨论(0)
  • 2020-12-04 23:02

    Convert you RDD into a RDD of Double, and then use the .histogram(10) action. See DoubleRDD ScalaDoc

    0 讨论(0)
  • 2020-12-04 23:02

    Another alternative way can be to use top and last on RDD of double. For example, val percentile_99th_value=scores.top((count/100).toInt).last

    This method is more suited for individual percentiles.

    0 讨论(0)
提交回复
热议问题