How to use sklearn Column Transformer?

问题

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

回答1:

I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)

回答2:

It is strange you want to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])

Use it as so:

clf.fit(X_train,y_train)

this will apply the preprocessor and then pass transfomed data to the predictor.

回答3:

@Fawwaz Yusran To tackle this warning...

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

Remove the following...

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

Since you are using OneHotEncoder directly you don't need LabelEncoder.

回答4:

from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
    remainder="passthrough")
X = preprocess.fit_transform(X)

I fixed the same issue using the above code.

回答5:

Since you are transforming only country column (i.e., [0] in your example). Use remainder="passthrough" to get remaining columns so that you will get those columns as it is.

try:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder=LabelEncoder()
x[:,0]=labelencoder.fit_transform(x[:,0])
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder() 
                               [0])],remainder="passthrough")
x = np.array(preprocess.fit_transform(x), dtype=np.int)

回答6:

Simplest Method is use pandas dummies on your CVS Data Frame

dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])

finished Your dataset will look like this

来源：https://stackoverflow.com/questions/54160370/how-to-use-sklearn-column-transformer

标签

python

scikit-learn