Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

后端 未结 2 1905
别跟我提以往
别跟我提以往 2021-01-12 17:13

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process a

2条回答
  •  死守一世寂寞
    2021-01-12 17:42

    No, sklearn doesn't do fit_transform with entire dataset.

    To check this, I subclassed StandardScaler to print the size of the dataset sent to it.

    class StScaler(StandardScaler):
        def fit_transform(self,X,y=None):
            print(len(X))
            return super().fit_transform(X,y)
    

    If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.

    But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.

    X,y = load_breast_cancer(return_X_y=True)
    X_sc = StScaler().fit_transform(X)
    lr = KNeighborsClassifier(n_neighbors=1)
    cross_val_score(lr, X_sc,y, cv=5)
    

    Outputs:

    569
    [0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
    

    And the 2nd case,

    pipe = Pipeline([
        ('sc', StScaler()),
        ('lr', KNeighborsClassifier(n_neighbors=1))
    ])
    print(cross_val_score(pipe, X, y, cv=5))
    

    Outputs:

    454
    454
    456
    456
    456
    [0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
    

    Not big change accuracy-wise, but change nonetheless.

提交回复
热议问题