How to compute percentiles in Apache Spark

后端 未结 10 540
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  旧时难觅i
    2020-12-04 22:58

    If N percent is small like 10, 20% then I will do the following:

    1. Compute the size of dataset, rdd.count(), skip it maybe you know it already and take as argument.

    2. Rather then sorting the whole dataset, I will find out top(N) from each partition. For that I would have to find out N = what is N% of rdd.count, then sort the partitions and take top(N) from each partition. Now you have a much smaller dataset to sort.

    3.rdd.sortBy

    4.zipWithIndex

    5.filter (index < topN)

提交回复
热议问题