问题
As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?
回答1:
Per the documentation, if you employ Pipeline
, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.
来源:https://stackoverflow.com/questions/32612944/how-to-use-scikits-preprocessing-normalization-along-with-cross-validation