I have an rdd of integers (i.e. RDD[Int]
) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]
If N percent is small like 10, 20% then I will do the following:
Compute the size of dataset, rdd.count(), skip it maybe you know it already and take as argument.
Rather then sorting the whole dataset, I will find out top(N) from each partition. For that I would have to find out N = what is N% of rdd.count, then sort the partitions and take top(N) from each partition. Now you have a much smaller dataset to sort.
3.rdd.sortBy
4.zipWithIndex
5.filter (index < topN)
Here is my easy approach:
val percentiles = Array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)
val accuracy = 1000000
df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
output:
scala> df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
res88: Array[Double] = Array(0.011044141836464405, 0.02022990956902504, 0.0317261666059494, 0.04638145491480827, 0.06498630344867706, 0.0892181545495987, 0.12161539494991302, 0.16825592517852783, 0.24740923941135406, 0.9188197255134583)
accuracy: The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
Convert you RDD into a RDD of Double, and then use the .histogram(10)
action. See DoubleRDD ScalaDoc
Another alternative way can be to use top and last on RDD of double. For example, val percentile_99th_value=scores.top((count/100).toInt).last
This method is more suited for individual percentiles.