Replace missing values with mean - Spark Dataframe

前端未结

关注

 3  1800

青春惊慌失措 2020-11-27 21:55

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t

3条回答

余生分开走 (楼主)

2020-11-27 22:10

For imputing the median (instead of the mean) in PySpark < 2.2

## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with 
median_dict = dict()
for c in num_cols:
   median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]

Then, apply na.fill

df_imputed = df.na.fill(median_dict)

0 讨论(0)

查看其它3个回答