How to implement n times repeated k-folds cross validation that yields n*k folds in sklearn?

只愿长相守 提交于 2021-02-08 06:23:11

问题


I got some trouble in implementing a cross validation setting that i saw in a paper. Basically it is explained in this attached picture:

So, it says that they use 5 folds, which means k = 5. But then, the authors said that they repeat the cross validation 20 times, which created 100 folds in total. Does that mean that i can just use this piece of code :

kfold = StratifiedKFold(n_splits=100, shuffle=True, random_state=seed)

Cause basically my code also yields 100-folds. Any recommendation?


回答1:


I'm pretty sure they are talking about RepeatedStratifiedKFold. You have 2 simple ways to create 5-folds for 20 times.

Method 1:

For your case, n_splits=5, n_repeats=20. Code below is just sample from scikit-learn website.

from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2,
...     random_state=42)
>>> for train_index, test_index in rskf.split(X, y):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
...
TRAIN: [1 2] TEST: [0 3] # n_repeats==1: the folds are [1 2] and [0 3]
TRAIN: [0 3] TEST: [1 2]
TRAIN: [1 3] TEST: [0 2] # n_repeats==2: the folds are [1 3] and [0 2]
TRAIN: [0 2] TEST: [1 3]

Method 2:

You can achieve the same effect with looping. Note that the random_state cannot be a fixed number, otherwise you will get the same 5 folds for 20 times.

for i in range(20):
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)

Why is it different from your code?

Say you have 10000 data points and you create 100 folds. Size of 1 fold = 100. Your training set=9900 versus validation set=100.

RepeatedStratifiedKFold creates 5 folds for your model, each fold is 2000. Then it repeats making a 5 folds again, and again, for 20 times. That means that you achieve 100 folds, but have a much large validation set. Depending on your objective, you might want a larger validation set, eg. to have enough data to properly validate, and RepeatedStratifiedKFold gives you that ability to create the same number of folds in a different way (with different training-validation proportion). Other than that, I'm not sure if there's any other objectives.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html

Thank you RepeatedStratifiedKFold.




回答2:


what about

for i in range(100):
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)


来源:https://stackoverflow.com/questions/43613726/how-to-implement-n-times-repeated-k-folds-cross-validation-that-yields-nk-folds

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!