Custom cross validation split sklearn

╄→尐↘猪︶ㄣ 提交于 2019-12-03 09:01:11

This type of thing can usually be done with sklearn.cross_validation.LeaveOneLabelOut. You just need to construct a label vector that encodes your groups. I.e., all samples in K1 would take label 1, all samples in K2 would take label 2, and so on.

Here is a fully runnable example with fake data. The important lines are the one creating the cv object, and the call to cross_val_score

import numpy as np

n_features = 10

# Make some data
A = np.random.randn(3, n_features)
B = np.random.randn(5, n_features)
C = np.random.randn(4, n_features)
D = np.random.randn(7, n_features)
E = np.random.randn(9, n_features)

# Group it
K1 = np.concatenate([A, B])
K2 = np.concatenate([C, D])
K3 = E

data = np.concatenate([K1, K2, K3])

# Make some dummy prediction target
target = np.random.randn(len(data)) > 0

# Make the corresponding labels
labels = np.concatenate([[i] * len(K) for i, K in enumerate([K1, K2, K3])])

from sklearn.cross_validation import LeaveOneLabelOut, cross_val_score

cv = LeaveOneLabelOut(labels)

# Use some classifier in crossvalidation on data
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
scores = cross_val_score(lr, data, target, cv=cv)

However, it is of course possible that you run into a situation where you would like to define your folds by hand completely. In this case you would need to create an iterable (e.g. a list) of couples (train, test) indicating via indices which samples to take into your train and test sets of each fold. Let's check this:

# create train and test folds from our labels:
cv_by_hand = [(np.where(labels != label)[0], np.where(labels == label)[0])
               for label in np.unique(labels)]

# We check this against our existing cv by converting the latter to a list
cv_to_list = list(cv)

print cv_by_hand
print cv_to_list

# Check equality
for (train1, test1), (train2, test2) in zip(cv_by_hand, cv_to_list):
    assert (train1 == train2).all() and (test1 == test2).all()

# Use the created cv_by_hand in cross validation
scores2 = cross_val_score(lr, data, target, cv=cv_by_hand)


# assert equality again
assert (scores == scores2).all()

I know this question is quite old, but I had the same problem. Looks like there will soon be a contribution that lets you do this:

https://github.com/scikit-learn/scikit-learn/pull/4583

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!