问题
I am working with KFlold using sklearn version 0.22. It has a parameter shuffle.
According to the documentation
shuffle boolean, optional Whether to shuffle the data before splitting into batches.
I ran a simple comparison of using KFold with shuffle set to False (default) and True:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics
X, y = load_digits(return_X_y=True)
def run_nfold(X,y, classifier, scorer, cv, n_repeats):
results = []
for n in range(n_repeats):
for train_index, test_index in cv.split(X, y):
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
classifier.fit(x_train, y_train)
results.append(scorer(y_test, classifier.predict(x_test)))
return results
classifier = SGDClassifier(loss='hinge', penalty='elasticnet', fit_intercept=True)
scorer = metrics.accuracy_score
n_splits = 5
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))
kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))
produces
KFold mean = 0.9119255648406066
KFold Shuffled mean = 0.9505304859176724
Using Kolmogorov-Smirnov test:
print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
shows the default non-shuffled KFold produces statistically significant lower results than the shuffled KFold:
Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)
I don't understand the difference between the shuffling vs non-shuffling results, why does it change the distribution of the output so drastically.
来源:https://stackoverflow.com/questions/60496628/difference-between-sklearn-kfold-with-and-without-using-shuffle