Replace missing values with mean - Spark Dataframe

前端未结

关注

 3  1797

青春惊慌失措 2020-11-27 21:55

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t

3条回答

失恋的感觉 (楼主)

2020-11-27 22:29
For PySpark, this is the code I used:
```
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
```
The four steps are:
1. Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
2. Calculate the mean for each column, and save it as the dictionary col_avgs
3. The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
4. Fill the columns of the dataframe with the averages using col_avgs
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...