How to split/partition a dataset into training and test datasets for, e.g., cross validation?

前端 未结 12 2009
醉话见心
醉话见心 2020-11-27 10:42

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind

12条回答
  •  清酒与你
    2020-11-27 11:36

    You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

    import numpy as np  
    
    def get_train_test_inds(y,train_proportion=0.7):
        '''Generates indices, making random stratified split into training set and testing sets
        with proportions train_proportion and (1-train_proportion) of initial sample.
        y is any iterable indicating classes of each observation in the sample.
        Initial proportions of classes inside training and 
        testing sets are preserved (stratified sampling).
        '''
    
        y=np.array(y)
        train_inds = np.zeros(len(y),dtype=bool)
        test_inds = np.zeros(len(y),dtype=bool)
        values = np.unique(y)
        for value in values:
            value_inds = np.nonzero(y==value)[0]
            np.random.shuffle(value_inds)
            n = int(train_proportion*len(value_inds))
    
            train_inds[value_inds[:n]]=True
            test_inds[value_inds[n:]]=True
    
        return train_inds,test_inds
    
    y = np.array([1,1,2,2,3,3])
    train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
    print y[train_inds]
    print y[test_inds]
    

    This code outputs:

    [1 2 3]
    [1 2 3]
    

提交回复
热议问题