问题
I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such:
0 Male
1 Female
2 NaN
After applying OneHotEncoder:
0 10
1 01
2 00
However, when running the following code:
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
remainder='passthrough')
obj_df = np.array(ct.fit_transform(obj_df))
print(obj_df)
I am getting the error ValueError: Input contains NaN
So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong. Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.
I found this question that is similar but the answer doesn't offer a detailed enough solution on how to deal with the NaN values.
Let me know what your thoughts are, thanks.
回答1:
You will need to impute the missing values before. You can define a Pipeline with an imputing step using SimpleImputer setting a most_frequent
strategy for instance, prior to the OneHot encoding:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0])
])
df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1.],
[1., 0.],
[1., 0.]])
回答2:
- Change the NaN values with "Others".
- Then proceed with one-hot encoding
- You can then remove the "Others" column.
来源:https://stackoverflow.com/questions/62409303/how-to-handle-missing-values-nan-in-categorical-data-when-using-scikit-learn-o