Median / quantiles within PySpark groupBy

后端 未结 5 1018
感情败类
感情败类 2020-12-04 15:26

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use withi

5条回答
  •  萌比男神i
    2020-12-04 16:17

    problem of "percentile_approx(val, 0.5)": if e.g. range is [1,2,3,4] this function returns 2 (as median) the function below returns 2.5:

    import statistics
    
    median_udf = F.udf(lambda x: statistics.median(x) if bool(x) else None, DoubleType())
    
    ... .groupBy('something').agg(median_udf(F.collect_list(F.col('value'))).alias('median'))
    

提交回复
热议问题