Pyspark: Add the average as a new column to DataFrame

混江龙づ霸主 提交于 2020-06-28 05:33:08

问题


I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and after the transformation of a column.

Before computing mean and adding "mean" column

result.select("dis_price_released").show(10)
 +------------------+
 |dis_price_released|
 +------------------+
 |               0.0|
 |               4.0|
 |               4.0|
 |               4.0|
 |               1.0|
 |               4.0|
 |               4.0|
 |               0.0|
 |               4.0|
 |               0.0|
 +------------------+

After computing mean and adding mean column

w = Window().partitionBy("dis_price_released").rowsBetween(-sys.maxsize, sys.maxsize)
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

+------------------+----+
|dis_price_released|mean|
+------------------+----+
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
+------------------+----+

回答1:


You can compute the avg first for the whole column, then use lit() to add it as a variable to your DataFrame, there is no need for window functions:

from pyspark.sql.functions import lit

mean = df.groupBy().avg("dis_price_released").take(1)[0][0]
df.withColumn("test", lit(mean)).show()
 +------------------+----+
|dis_price_released|test|
+------------------+----+
|               0.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               1.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
+------------------+----+



回答2:


This is yet another way to solve the problem

df.withColumn("mean", lit(df.select(avg("dis_price_released").as("temp")).first().getAs("temp"))).show


来源:https://stackoverflow.com/questions/44382822/pyspark-add-the-average-as-a-new-column-to-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!