Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn

你离开我真会死。 提交于 2021-01-04 05:49:44

问题


I'm working with a really simple dataset. It has some missing values, both in categorical and numeric features. Because of this, I'm trying to use sklearn.preprocessing.KNNImpute to get the most accurate imputation I can. However, when I run the following code:

imputer = KNNImputer(n_neighbors=120)

imputer.fit_transform(x_train)

I get the error: ValueError: could not convert string to float: 'Private'

That makes sense, it obviously can't handle categorical data. But when I try to run OneHotEncoder with:

encoder = OneHotEncoder(drop="first")

encoder.fit_transform(x_train[categorical_features])

It throws the error: ValueError: Input contains NaN

I'd prefer to use KNNImpute even with the categorical data as I feel like I'd be losing some accuracy if I just use a ColumnTransform and impute with numeric and categorical data seperately. Is there any way to get OneHotEncoder to ignore these missing values? If not, is using ColumnTransform or a simpler imputer a better way of tackling this problem?

Thanks in advance


回答1:


There are open issues/PRs to handle missing values on OneHotEncoder, but it's not clear yet what the options would be. In the interim, here's a manual approach.

  • Fill categorical missings with pandas or SimpleImputer with the string "missing".
  • Use OneHotEncoder then.
  • Use the one-hot encoder's get_feature_names to identify the columns corresponding to each original feature, and in particular the "missing" indicator.
  • For each row and each original categorical feature, when the 1 is in the "missing" column, replace the 0's with np.nan; then delete the missing indicator column.
  • Now everything should be set up to run KNNImputer.
  • Finally, if desired, postprocess the imputed categorical-encoding columns. (Simply rounding might get you an all-zeros row for a categorical feature, but I don't think with KNNImputer you could get more than one 1 in a row. You could argmax instead to get back exactly one 1.)


来源:https://stackoverflow.com/questions/62868129/cyclical-loop-between-onehotencoder-and-knnimpute-in-scikit-learn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!