How to split data into trainset and testset randomly?

后端 未结 9 1294
花落未央
花落未央 2020-12-07 16:27

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

9条回答
  •  -上瘾入骨i
    2020-12-07 17:26

    The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

    import random, math
    
    def k_fold(myfile, myseed=11109, k=3):
        # Load data
        data = open(myfile).readlines()
    
        # Shuffle input
        random.seed=myseed
        random.shuffle(data)
    
        # Compute partition size given input k
        len_part=int(math.ceil(len(data)/float(k)))
    
        # Create one partition per fold
        train={}
        test={}
        for ii in range(k):
            test[ii]  = data[ii*len_part:ii*len_part+len_part]
            train[ii] = [jj for jj in data if jj not in test[ii]]
    
        return train, test      
    

提交回复
热议问题