I\'m trying to use scikit-learn\'s LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to a
Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer
and sklearn.preprocessing.OneHotEncoder
:
If you only have categorical variables, OneHotEncoder
directly:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(handle_unknown='ignore').fit_transform(df)
If you have heterogeneously typed features:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
(categorical_columns, OneHotEncoder(handle_unknown='ignore'),
(numerical_columns, RobustScaler())
column_trans.fit_transform(df)
More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data