Replace missing values with mean - Spark Dataframe

别来无恙 提交于 2019-11-27 09:05:09
user6910411

Spark >= 2.2

You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).

Scala :

import org.apache.spark.ml.feature.Imputer

val imputer = new Imputer()
  .setInputCols(df.columns)
  .setOutputCols(df.columns.map(c => s"${c}_imputed"))
  .setStrategy("mean")

imputer.fit(df).transform(df)

Python:

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=df.columns, 
    outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)

Spark < 2.2

Here you are:

import org.apache.spark.sql.functions.mean

df.na.fill(df.columns.zip(
  df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)

where

df.columns.map(mean(_)): Array[Column] 

computes an average for each column,

df.select(_: *).first.toSeq: Seq[Any]

collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),

df.columns.zip(_).toMap: Map[String,Any] 

creates aMap: Map[String, Any] which maps from the column name to its average, and finally:

df.na.fill(_): DataFrame

fills the missing values using:

fill: Map[String, Any] => DataFrame 

from DataFrameNaFunctions.

To ingore NaN entries you can replace:

df.select(df.columns.map(mean(_)): _*).first.toSeq

with:

import org.apache.spark.sql.functions.{col, isnan, when}


df.select(df.columns.map(
  c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq

For imputing the median (instead of the mean) in PySpark < 2.2

## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
   median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]

Then, apply na.fill

df_imputed = df.na.fill(median_dict)

For PySpark, this is the code I used:

mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()

The four steps are:

  1. Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
  2. Calculate the mean for each column, and save it as the dictionary col_avgs
  3. The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
  4. Fill the columns of the dataframe with the averages using col_avgs
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!