问题
I want to use KNN for imputing categorical features in a sklearn pipeline (muliple Categorical features missing).
I have done quite a bit research on existing KNN solution (fancyimpute, sklearn KneighborRegressor). None of them seem to be working in terms
- work in a sklearn pipeline
- impute categorical features
Some of my questions are (any advice is highly appreciated):
- is there any existing approach to allow using KNN (or any other regressor) to impute missing values (categorical in this case) to work with sklearn pipeline
- fancyimpute KNN implementation seems not use hamming distance for imputing missing values (which is ideal for categorical features).
- is there any fast KNN method implementation available considering KNN is time consuming when imputing missing values (i.e., run prediction on missing values against the whole datasets)
回答1:
The default KNeighborRegressor is supposed to be able to work with regressing missing values, however, with numeric values only. Therefore for categorical value, I believe you most likely need to encode it first, then impute the missing values.
KNNImpute, most likely uses mean/mode etc
iterativeimputer from sklearn can run the imputation against the whole datasets
回答2:
KNNImputer is new as of sklearn version 0.22.0
KNNImputer uses a euclidean distance metric by default, but you can pass in your own custom distance metric.
I can't speak to the speed of KNNImputer, but I'd imagine there have been some optimizations done on it if it's made it into sklearn.
来源:https://stackoverflow.com/questions/57775064/how-to-implement-knn-to-impute-categorical-features-in-a-sklearn-pipeline