How to calculate Median in spark sqlContext for column of data type double

前端 未结 3 956
温柔的废话
温柔的废话 2020-12-03 15:48

I have given the sample table. I want to get the median from \"value\" column for each group \"source\" column. Where source column is of String DataType value column is of

相关标签:
3条回答
  • 2020-12-03 16:25

    Here is how it can be done using Spark Scala dataframe functions. This is based on how Imputer is implemented for median strategy in Spark>=2.2 - https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala -

      df.select(colName)
            .stat
            .approxQuantile(colName, Array(0.5), 0.001) //median
            .head
    
    0 讨论(0)
  • 2020-12-03 16:32

    For non integral values you should use percentile_approx UDF:

    import org.apache.spark.mllib.random.RandomRDDs
    
    val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
    df.registerTempTable("df")
    sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df").show
    
    // +--------------------+
    // |                 _c0|
    // +--------------------+
    // |0.035379710486199915|
    // +--------------------+
    

    On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and has different effect than you expect.

    SELECT source, percentile_approx(value, 0.5) FROM df GROUP BY source
    

    See also How to find median using Spark

    0 讨论(0)
  • 2020-12-03 16:37

    Have you tried the DataFrame.describe() method?

    https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html#describe(java.lang.String...)

    not sure it's exactly what you're looking for, but might get you closer.

    0 讨论(0)
提交回复
热议问题