How to calculate Median in spark sqlContext for column of data type double

前端 未结 3 964
温柔的废话
温柔的废话 2020-12-03 15:48

I have given the sample table. I want to get the median from \"value\" column for each group \"source\" column. Where source column is of String DataType value column is of

3条回答
  •  不知归路
    2020-12-03 16:32

    For non integral values you should use percentile_approx UDF:

    import org.apache.spark.mllib.random.RandomRDDs
    
    val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
    df.registerTempTable("df")
    sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df").show
    
    // +--------------------+
    // |                 _c0|
    // +--------------------+
    // |0.035379710486199915|
    // +--------------------+
    

    On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and has different effect than you expect.

    SELECT source, percentile_approx(value, 0.5) FROM df GROUP BY source
    

    See also How to find median using Spark

提交回复
热议问题