Median / quantiles within PySpark groupBy

后端 未结 5 986
感情败类
感情败类 2020-12-04 15:26

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use withi

5条回答
  •  北海茫月
    2020-12-04 16:16

    Since you have access to percentile_approx, one simple solution would be to use it in a SQL command:

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    
    df.registerTempTable("df")
    df2 = sqlContext.sql("select grp, percentile_approx(val, 0.5) as med_val from df group by grp")
    

提交回复
热议问题