I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example.
The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2
below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.
import random, math
def k_fold(myfile, myseed=11109, k=3):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed=myseed
random.shuffle(data)
# Compute partition size given input k
len_part=int(math.ceil(len(data)/float(k)))
# Create one partition per fold
train={}
test={}
for ii in range(k):
test[ii] = data[ii*len_part:ii*len_part+len_part]
train[ii] = [jj for jj in data if jj not in test[ii]]
return train, test