How can I standardize only numeric variables in an sklearn pipeline?

前端 未结 3 1135
后悔当初
后悔当初 2020-12-06 17:21

I am trying to create an sklearn pipeline with 2 steps:

  1. Standardize the data
  2. Fit the data using KNN

However, my data has both numeric a

相关标签:
3条回答
  • 2020-12-06 17:26

    Since you have converted your categorical features into dummies using pd.get_dummies, so you don't need to use OneHotEncoder. As a result, your pipeline should be:

    from sklearn.preprocessing import StandardScaler,FunctionTransformer
    from sklearn.pipeline import Pipeline,FeatureUnion
    
    knn=KNeighborsClassifier()
    
    pipeline=Pipeline(steps= [
        ('feature_processing', FeatureUnion(transformer_list = [
                ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),
    
                #numeric
                ('numeric', Pipeline(steps = [
                    ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                    ('scale', StandardScaler())
                            ]))
            ])),
        ('clf', knn)
        ]
    )
    
    0 讨论(0)
  • 2020-12-06 17:30

    I would use FeatureUnion. I then usually do something like that, assuming you dummy-encode your categorical variables also within the pipeline instead of before with Pandas:

    from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.neighbors import KNeighborsClassifier
    
    class Columns(BaseEstimator, TransformerMixin):
        def __init__(self, names=None):
            self.names = names
    
        def fit(self, X, y=None, **fit_params):
            return self
    
        def transform(self, X):
            return X[self.names]
    
    numeric = [list of numeric column names]
    categorical = [list of categorical column names]
    
    pipe = Pipeline([
        ("features", FeatureUnion([
            ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
            ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
        ])),
        ('model', KNeighborsClassifier())
    ])
    

    You could further check out Sklearn Pandas, which is also interesting.

    0 讨论(0)
  • 2020-12-06 17:47

    Assuming you have the following DF:

    In [163]: df
    Out[163]:
         a     b    c    d
    0  aaa  1.01  xxx  111
    1  bbb  2.02  yyy  222
    2  ccc  3.03  zzz  333
    
    In [164]: df.dtypes
    Out[164]:
    a     object
    b    float64
    c     object
    d      int64
    dtype: object
    

    you can find all numeric columns:

    In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
    
    In [166]: num_cols
    Out[166]: Index(['b', 'd'], dtype='object')
    
    In [167]: df[num_cols]
    Out[167]:
          b    d
    0  1.01  111
    1  2.02  222
    2  3.03  333
    

    and apply StandardScaler only to those numeric columns:

    In [168]: scaler = StandardScaler()
    
    In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
    
    In [170]: df
    Out[170]:
         a         b    c         d
    0  aaa -1.224745  xxx -1.224745
    1  bbb  0.000000  yyy  0.000000
    2  ccc  1.224745  zzz  1.224745
    

    now you can "one hot encode" categorical (non-numeric) columns...

    0 讨论(0)
提交回复
热议问题