I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use withi
Since you have access to percentile_approx, one simple solution would be to use it in a SQL command:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df.registerTempTable("df")
df2 = sqlContext.sql("select grp, percentile_approx(val, 0.5) as med_val from df group by grp")