问题
I have a problem but I feel the solution should be quite simple. I'm building a model and want to test its accuracy by 10-fold cross-validation. To do this I have to split my training corpus 90%/10% into training and test sections, then train my model on the 90% and test on the 10%. This I want to do ten times, by taking a different 90%/10% split every time, so that eventually each bit of the corpus has been used as testing data. Then I'll average the results for each 10% test.
I have tried to write a script to extract 10% of the training corpus and write it to a new file, but so far I don't get it working. What I have done is counting the total number of lines in the file, and then dividing this number by ten to know the size of each of the ten different test sets that I want to extract.
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
I have found, for my own training file, that it consists of 3638 lines, so each test should consist roughly of 363 lines.
How do I write line 1-363, line 364-726, etc. to different test files?
回答1:
Once you have the count of lines, go back to the beginning of the file, and start copying out lines to danish.train.part-01. When the line number is a multiple of the size of the 10% test set, open a new file for the next part.
#!/usr/bin/env python2.7
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
# rewind file to beginning
trainFile.seek(0)
numberOfLines = 0
file_number = 0
for line in trainFile:
if numberOfLines % lengthTest == 0:
file_number += 1
output = open('danish.train.part-%02d' % file_number, 'w')
numberOfLines += 1
output.write(line)
On this input file (sorry I don’t speak Danish!):
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
This creates files
danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10
and part 5, for example, contains:
thirteen
fourteen
fifteen
回答2:
Untested, but here's the basic idea:
def getNthSeg(fpath, n, segSize):
"""Get the nth segment of segSize many lines"""
answer = []
with open(fpath) as f:
for i,line in enumerate(f):
if (segSize-1)*n <= i < segSize*n:
answer.append(line)
return answer
def getFolds(fpath, k):
""" In your case, k is 10"""
with open(fpath) as f:
numLines = len(f.readlines())
segSize = numLines/k
answer = []
for n in xrange(k):
fold = getNthSeg(fpath, n, segSize)
answer.append(fold)
return answer
回答3:
If your file isn't huge, you can split it into 90/10 like this:
trainFile = open("danish.train")
lines = list(trainFile)
N = len(lines)
testing = lines[:N/10]
training = lines[N/10:]
来源:https://stackoverflow.com/questions/14714528/how-to-extract-specific-lines-from-a-data-file