Median / quantiles within PySpark groupBy

后端未结

关注

 5  1076

感情败类 2020-12-04 15:26

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use withi

5条回答

萌比男神i (楼主)

2020-12-04 16:17
problem of "percentile_approx(val, 0.5)": if e.g. range is [1,2,3,4] this function returns 2 (as median) the function below returns 2.5:
```
import statistics

median_udf = F.udf(lambda x: statistics.median(x) if bool(x) else None, DoubleType())

... .groupBy('something').agg(median_udf(F.collect_list(F.col('value'))).alias('median'))
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...