I have given the sample table. I want to get the median from \"value\" column for each group \"source\" column. Where source column is of String DataType value column is of
Here is how it can be done using Spark Scala dataframe functions. This is based on how Imputer is implemented for median strategy in Spark>=2.2 - https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala -
df.select(colName)
.stat
.approxQuantile(colName, Array(0.5), 0.001) //median
.head
For non integral values you should use percentile_approx
UDF:
import org.apache.spark.mllib.random.RandomRDDs
val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
df.registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df").show
// +--------------------+
// | _c0|
// +--------------------+
// |0.035379710486199915|
// +--------------------+
On a side not you should use GROUP BY
not PARTITION BY
. Latter one is used for window functions and has different effect than you expect.
SELECT source, percentile_approx(value, 0.5) FROM df GROUP BY source
See also How to find median using Spark
Have you tried the DataFrame.describe() method?
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html#describe(java.lang.String...)
not sure it's exactly what you're looking for, but might get you closer.