how to extract specific lines from a data file

时光毁灭记忆、已成空白 提交于 2020-01-05 08:43:16

问题


I have a problem but I feel the solution should be quite simple. I'm building a model and want to test its accuracy by 10-fold cross-validation. To do this I have to split my training corpus 90%/10% into training and test sections, then train my model on the 90% and test on the 10%. This I want to do ten times, by taking a different 90%/10% split every time, so that eventually each bit of the corpus has been used as testing data. Then I'll average the results for each 10% test.

I have tried to write a script to extract 10% of the training corpus and write it to a new file, but so far I don't get it working. What I have done is counting the total number of lines in the file, and then dividing this number by ten to know the size of each of the ten different test sets that I want to extract.

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

I have found, for my own training file, that it consists of 3638 lines, so each test should consist roughly of 363 lines.

How do I write line 1-363, line 364-726, etc. to different test files?


回答1:


Once you have the count of lines, go back to the beginning of the file, and start copying out lines to danish.train.part-01. When the line number is a multiple of the size of the 10% test set, open a new file for the next part.

#!/usr/bin/env python2.7

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

# rewind file to beginning
trainFile.seek(0)

numberOfLines = 0
file_number = 0
for line in trainFile:
    if numberOfLines % lengthTest == 0:
        file_number += 1
        output = open('danish.train.part-%02d' % file_number, 'w')

    numberOfLines += 1
    output.write(line)

On this input file (sorry I don’t speak Danish!):

one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty

This creates files

danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10

and part 5, for example, contains:

thirteen
fourteen
fifteen



回答2:


Untested, but here's the basic idea:

def getNthSeg(fpath, n, segSize):
    """Get the nth segment of segSize many lines"""
    answer = []
    with open(fpath) as f:
        for i,line in enumerate(f):
            if (segSize-1)*n <= i < segSize*n:
                answer.append(line)
    return answer

def getFolds(fpath, k):
    """ In your case, k is 10"""
    with open(fpath) as f:
        numLines = len(f.readlines())
    segSize = numLines/k
    answer = []
    for n in xrange(k):
        fold = getNthSeg(fpath, n, segSize)
        answer.append(fold)
    return answer



回答3:


If your file isn't huge, you can split it into 90/10 like this:

trainFile = open("danish.train")
lines = list(trainFile)
N = len(lines)
testing = lines[:N/10]
training = lines[N/10:]


来源:https://stackoverflow.com/questions/14714528/how-to-extract-specific-lines-from-a-data-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!