approxQuantile give incorrect Median in Spark (Scala)?

后端 未结 3 756
太阳男子
太阳男子 2020-12-19 09:53

I have this test data:

 val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.53         


        
相关标签:
3条回答
  • 2020-12-19 10:14

    This is the result from my local. Do you do something similar?

     val data = List(
            List(47.5335D),
            List(67.5335D),
            List(69.5335D),
            List(444.1235D),
            List(677.5335D)
          )
    
    val df = data.flatten.toDF
    
    df.stat.approxQuantile("value", Array(0.5), 0)
    // res18: Array[Double] = Array(67.5335)
    
    0 讨论(0)
  • 2020-12-19 10:20

    Note that this is an approximate quantiles computation. It is not supposed to give you the exact answer all the time. See here for a more thorough explanation.

    The reason is that for very large datasets, sometimes you are OK with an approximate answer, as long as you get it significantly faster than the exact computation.

    0 讨论(0)
  • 2020-12-19 10:31

    I encountered this similar problem when trying to use the approxQuantile() method with Spark-2.2.1. When I upgraded to Spark-2.4.3, approxQuantile() now returns the right exact median.

    0 讨论(0)
提交回复
热议问题