How to split data into trainset and testset randomly?

后端未结

关注

 9  1309

花落未央 2020-12-07 16:27

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

9条回答

-上瘾入骨i (楼主)

2020-12-07 17:26

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input
    random.seed=myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_part=int(math.ceil(len(data)/float(k)))

    # Create one partition per fold
    train={}
    test={}
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]

    return train, test

0 讨论(0)

查看其它9个回答