I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
mean_dict
mapping column names to the aggregate operation (mean)col_avgs
col_avgs
start with avg(
and end with )
, e.g. avg(col1)
. Strip the parentheses out.col_avgs