Python how to read N number of lines at a time

后端 未结 6 1125
感情败类
感情败类 2020-11-27 03:42

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file.

6条回答
  •  生来不讨喜
    2020-11-27 04:05

    Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.

    """randsamp - extract a random subset of n lines from a large file"""
    
    import random
    
    def scan_linepos(path):
        """return a list of seek offsets of the beginning of each line"""
        linepos = []
        offset = 0
        with open(path) as inf:     
            # WARNING: CPython 2.7 file.tell() is not accurate on file.next()
            for line in inf:
                linepos.append(offset)
                offset += len(line)
        return linepos
    
    def sample_lines(path, linepos, nsamp):
        """return nsamp lines from path where line offsets are in linepos"""
        offsets = random.sample(linepos, nsamp)
        offsets.sort()  # this may make file reads more efficient
    
        lines = []
        with open(path) as inf:
            for offset in offsets:
                inf.seek(offset)
                lines.append(inf.readline())
        return lines
    
    dataset = 'big_data.txt'
    nsamp = 5
    linepos = scan_linepos(dataset) # the scan only need be done once
    
    lines = sample_lines(dataset, linepos, nsamp)
    print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
    print ''.join(lines)
    

    I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.

    Just to check the performance of sample_lines I used the timeit module as so

    import timeit
    t = timeit.Timer('sample_lines(dataset, linepos, nsamp)', 
            'from __main__ import sample_lines, dataset, linepos, nsamp')
    trials = 10 ** 4
    elapsed = t.timeit(number=trials)
    print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
            elapsed, (elapsed/trials) * (10 ** 6))
    

    For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.

    The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".

提交回复
热议问题