How can I calculate exact median with Apache Spark?

前端 未结 2 1640
醉酒成梦
醉酒成梦 2020-12-06 05:30

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

2条回答
  •  北荒
    北荒 (楼主)
    2020-12-06 05:54

    You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

      import org.apache.spark.SparkContext._
    
      val rdd: RDD[Int] = ???
    
      val sorted = rdd.sortBy(identity).zipWithIndex().map {
        case (v, idx) => (idx, v)
      }
    
      val count = sorted.count()
    
      val median: Double = if (count % 2 == 0) {
        val l = count / 2 - 1
        val r = l + 1
        (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
      } else sorted.lookup(count / 2).head.toDouble
    

提交回复
热议问题