问题
I\'m trying to get the best set of parameters for an SVR model.
I\'d like to use the GridSearchCV over different values of C.
However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance).
To address this problem, I\'d like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV?
QUICK SOLUTION:
Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print \"Average Score: {0} STD: {1}\".format(numpy.mean(scores), numpy.std(scores))
回答1:
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()
- clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
- Pass
clf, X, y, outer_cvtocross_val_score - As seen in source code of cross_val_score, this
Xwill be divided intoX_outer_train, X_outer_testusingouter_cv. Same for y. X_outer_testwill be held back andX_outer_trainwill be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_trainis calledX_innerfrom here on since it is passed to inner estimator, assumey_outer_trainisy_inner.X_innerwill now be split intoX_inner_trainandX_inner_testusinginner_cvin the GridSearchCV. Same for y- Now the gridSearch estimator will be trained using
X_inner_trainandy_train_innerand scored usingX_inner_testandy_inner_test. - The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
- The hyper-parameters for which the average score over all inner iterations
(X_inner_train, X_inner_test)is best, is passed on to theclf.best_estimator_and fitted for all data, i.e.X_outer_train. - This
clf(gridsearch.best_estimator_) will then be scored usingX_outer_testandy_outer_test. - The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from
cross_val_score - We then use mean() to get back
nested_score.
回答2:
You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)
来源:https://stackoverflow.com/questions/42228735/scikit-learn-gridsearchcv-with-multiple-repetitions