How to compute percentiles in Apache Spark

后端 未结 10 524
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  遥遥无期
    2020-12-04 22:47

    I discovered this gist

    https://gist.github.com/felixcheung/92ae74bc349ea83a9e29

    that contains the following function:

      /**
       * compute percentile from an unsorted Spark RDD
       * @param data: input data set of Long integers
       * @param tile: percentile to compute (eg. 85 percentile)
       * @return value of input data at the specified percentile
       */
      def computePercentile(data: RDD[Long], tile: Double): Double = {
        // NIST method; data to be sorted in ascending order
        val r = data.sortBy(x => x)
        val c = r.count()
        if (c == 1) r.first()
        else {
          val n = (tile / 100d) * (c + 1d)
          val k = math.floor(n).toLong
          val d = n - k
          if (k <= 0) r.first()
          else {
            val index = r.zipWithIndex().map(_.swap)
            val last = c
            if (k >= c) {
              index.lookup(last - 1).head
            } else {
              index.lookup(k - 1).head + d * (index.lookup(k).head - index.lookup(k - 1).head)
            }
          }
        }
      }
    

提交回复
热议问题