问题
I got some trouble in implementing a cross validation setting that i saw in a paper. Basically it is explained in this attached picture:
So, it says that they use 5 folds, which means k = 5
. But then, the authors said that they repeat the cross validation 20 times, which created 100 folds in total. Does that mean that i can just use this piece of code :
kfold = StratifiedKFold(n_splits=100, shuffle=True, random_state=seed)
Cause basically my code also yields 100-folds. Any recommendation?
回答1:
I'm pretty sure they are talking about RepeatedStratifiedKFold
. You have 2 simple ways to create 5-folds for 20 times.
Method 1:
For your case, n_splits=5, n_repeats=20
. Code below is just sample from scikit-learn website.
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2,
... random_state=42)
>>> for train_index, test_index in rskf.split(X, y):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
...
TRAIN: [1 2] TEST: [0 3] # n_repeats==1: the folds are [1 2] and [0 3]
TRAIN: [0 3] TEST: [1 2]
TRAIN: [1 3] TEST: [0 2] # n_repeats==2: the folds are [1 3] and [0 2]
TRAIN: [0 2] TEST: [1 3]
Method 2:
You can achieve the same effect with looping. Note that the
random_state
cannot be a fixed number, otherwise you will get the same 5 folds for 20 times.for i in range(20): kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
Why is it different from your code?
Say you have 10000 data points and you create 100 folds. Size of 1 fold = 100. Your training set=9900 versus validation set=100.
RepeatedStratifiedKFold
creates 5 folds for your model, each fold is 2000. Then it repeats making a 5 folds again, and again, for 20 times. That means that you achieve 100 folds, but have a much large validation set. Depending on your objective, you might want a larger validation set, eg. to have enough data to properly validate, and RepeatedStratifiedKFold
gives you that ability to create the same number of folds in a different way (with different training-validation proportion). Other than that, I'm not sure if there's any other objectives.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html
Thank you RepeatedStratifiedKFold
.
回答2:
what about
for i in range(100):
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
来源:https://stackoverflow.com/questions/43613726/how-to-implement-n-times-repeated-k-folds-cross-validation-that-yields-nk-folds