How can I standardize only numeric variables in an sklearn pipeline?

前端未结

关注

 3  1135

后悔当初

I am trying to create an sklearn pipeline with 2 steps:

Standardize the data
Fit the data using KNN

However, my data has both numeric a

相关标签:

3条回答

失恋的感觉

2020-12-06 17:26

Since you have converted your categorical features into dummies using pd.get_dummies, so you don't need to use OneHotEncoder. As a result, your pipeline should be:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion

knn=KNeighborsClassifier()

pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', knn)
    ]
)

0 讨论(0)

你的背包

2020-12-06 17:30

I would use FeatureUnion. I then usually do something like that, assuming you dummy-encode your categorical variables also within the pipeline instead of before with Pandas:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

You could further check out Sklearn Pandas, which is also interesting.

0 讨论(0)

遇见更好的自我

2020-12-06 17:47

Assuming you have the following DF:

In [163]: df
Out[163]:
     a     b    c    d
0  aaa  1.01  xxx  111
1  bbb  2.02  yyy  222
2  ccc  3.03  zzz  333

In [164]: df.dtypes
Out[164]:
a     object
b    float64
c     object
d      int64
dtype: object

you can find all numeric columns:

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

and apply StandardScaler only to those numeric columns:

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

now you can "one hot encode" categorical (non-numeric) columns...

0 讨论(0)