Use sklearn's GridSearchCV with a pipeline, preprocessing just once

二次信任 提交于 2020-01-09 12:54:08

问题


I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?


回答1:


Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

Do this:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.




回答2:


It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322




回答3:


For those who stumbled upon a little bit different problem, that I had as well.

Suppose you have this pipeline:

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }


来源:https://stackoverflow.com/questions/43366561/use-sklearns-gridsearchcv-with-a-pipeline-preprocessing-just-once

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!