Why does calling the KFold generator with shuffle give the same indices?

折月煮酒 提交于 2020-01-01 16:51:12

问题


With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this?

Example:

from sklearn.cross_validation import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(4, n_folds=2, shuffle = True)
​
for fold in kf:
    print fold
​
print '---second round----'
​
for fold in kf:
    print fold

Output:

(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
---second round----#same indices for the folds
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))

This question was motivated by a comment on this answer. I decided to split it into a new question to prevent that answer from becoming too long.


回答1:


A new iteration with the same KFold object will not reshuffle the indices, that only happens during instantiation of the object. KFold() never sees the data but knows number of samples so it uses that to shuffle the indices. From the code during instantiation of KFold:

if shuffle:
    rng = check_random_state(self.random_state)
    rng.shuffle(self.idxs)

Each time a generator is called to iterate through the indices of each fold, it will use same shuffled indices and divide them the same way.

Take a look at the code for the base class of KFold _PartitionIterator(with_metaclass(ABCMeta)) where __iter__ is defined. The __iter__ method in the base class calls _iter_test_indices in KFold to divide and yield the train and test indices for each fold.



来源:https://stackoverflow.com/questions/34940465/why-does-calling-the-kfold-generator-with-shuffle-give-the-same-indices

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!