How to use scikit's preprocessing/normalization along with cross validation?

问题

As an example of cross-validation without any preprocessing, I can do something like this:

    tuned_params = [{"penalty" : ["l2", "l1"]}]
    from sklearn.linear_model import SGDClassifier
    SGD = SGDClassifier()
    from sklearn.grid_search import GridSearchCV
    clf = GridSearchCV(myClassifier, params, verbose=5)
    clf.fit(x_train, y_train)

I would like to preprocess my data using something like

from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)

But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?

回答1:

Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]

More relevant information on pipelines available here.

来源：https://stackoverflow.com/questions/32612944/how-to-use-scikits-preprocessing-normalization-along-with-cross-validation

标签

python

scikit-learn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!