Replace missing values with mean - Spark Dataframe

前端 未结 3 1792
青春惊慌失措
青春惊慌失措 2020-11-27 21:55

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t

3条回答
  •  余生分开走
    2020-11-27 22:10

    For imputing the median (instead of the mean) in PySpark < 2.2

    ## filter numeric cols
    num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
    ### Compute a dict with 
    median_dict = dict()
    for c in num_cols:
       median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
    

    Then, apply na.fill

    df_imputed = df.na.fill(median_dict)
    

提交回复
热议问题