Replace missing values with mean - Spark Dataframe

前端 未结 3 1791
青春惊慌失措
青春惊慌失措 2020-11-27 21:55

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t

3条回答
  •  失恋的感觉
    2020-11-27 22:29

    For PySpark, this is the code I used:

    mean_dict = { col: 'mean' for col in df.columns }
    col_avgs = df.agg( mean_dict ).collect()[0].asDict()
    col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
    df.fillna( col_avgs ).show()
    

    The four steps are:

    1. Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
    2. Calculate the mean for each column, and save it as the dictionary col_avgs
    3. The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
    4. Fill the columns of the dataframe with the averages using col_avgs

提交回复
热议问题