Find mean of pyspark array

后端未结

关注

 2  906

醉酒成梦 2021-01-17 18:31

In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type.

Is th

2条回答

轮回少年 (楼主)

2021-01-17 19:20
In your case, your options are use explode or a udf. As you've noted, explode is unnecessarily expensive. Thus, a udf is the way to go.

You can write your own function to take the mean of a list of numbers, or just piggy back off of numpy.mean. If you use numpy.mean, you'll have to cast the result to a float (because spark doesn't know how to handle numpy.float64s).
```
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

array_mean = udf(lambda x: float(np.mean(x)), FloatType())
df.select(array_mean("longitude").alias("avg")).show()
#+---------+
#|      avg|
#+---------+
#|    -81.9|
#|-82.93166|
#|   -82.93|
#+---------+
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...