问题
I\'m trying to get the best set of parameters for an SVR model.
I\'d like to use the GridSearchCV
over different values of C
.
However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance).
To address this problem, I\'d like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV
?
QUICK SOLUTION:
Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print \"Average Score: {0} STD: {1}\".format(numpy.mean(scores), numpy.std(scores))
回答1:
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score()
and GridSearchCV()
- clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
- Pass
clf, X, y, outer_cv
tocross_val_score
- As seen in source code of cross_val_score, this
X
will be divided intoX_outer_train, X_outer_test
usingouter_cv
. Same for y. X_outer_test
will be held back andX_outer_train
will be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_train
is calledX_inner
from here on since it is passed to inner estimator, assumey_outer_train
isy_inner
.X_inner
will now be split intoX_inner_train
andX_inner_test
usinginner_cv
in the GridSearchCV. Same for y- Now the gridSearch estimator will be trained using
X_inner_train
andy_train_inner
and scored usingX_inner_test
andy_inner_test
. - The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
- The hyper-parameters for which the average score over all inner iterations
(X_inner_train, X_inner_test)
is best, is passed on to theclf.best_estimator_
and fitted for all data, i.e.X_outer_train
. - This
clf
(gridsearch.best_estimator_
) will then be scored usingX_outer_test
andy_outer_test
. - The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from
cross_val_score
- We then use mean() to get back
nested_score
.
回答2:
You can supply different cross-validation generators to GridSearchCV
. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)
来源:https://stackoverflow.com/questions/42228735/scikit-learn-gridsearchcv-with-multiple-repetitions