How to calculate Median in spark sqlContext for column of data type double

前端未结

关注

 3  956

I have given the sample table. I want to get the median from \"value\" column for each group \"source\" column. Where source column is of String DataType value column is of

相关标签:

3条回答

长情又很酷

2020-12-03 16:25
Here is how it can be done using Spark Scala dataframe functions. This is based on how Imputer is implemented for median strategy in Spark>=2.2 - https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala -
```
  df.select(colName)
        .stat
        .approxQuantile(colName, Array(0.5), 0.001) //median
        .head
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

不知归路

2020-12-03 16:32

For non integral values you should use percentile_approx UDF:

import org.apache.spark.mllib.random.RandomRDDs

val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
df.registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df").show

// +--------------------+
// |                 _c0|
// +--------------------+
// |0.035379710486199915|
// +--------------------+

On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and has different effect than you expect.

SELECT source, percentile_approx(value, 0.5) FROM df GROUP BY source