How to read tokens without reading whole line or file

问题

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an itertools.izip() on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration.

The rows are space-delimited ASCII decimal numbers.

The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string.

For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either.

The full problem was to take the sample input:

"123 3 234234 -35434 112312 54 -439 99 0 42\n" +
"13 456 -78 910 333 -44 5555 6 8"

...and produce the output (as a generator, without reading all of very long rows into memory at once:

[123, 13], [3, 456], [234234, -78], ...etc

As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row.

回答1:

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))

It works even if the file doesn't fit in memory.

回答2:

Here is a generator that processes a file one character at a time and yields tokens when whitespace is encountered.

def generate_tokens(path):
    with open(path, 'r') as fp:
        buf = []
        while True:
            ch = fp.read(1)
            if ch == '':
                break
            elif ch.isspace():
                if buf:
                    yield ''.join(buf)
                    buf = []
            else:
                buf.append(ch)

if __name__ == '__main__':
    for token in generate_tokens('input.txt'):
        print token

To be more generic, it looks like you might be able to use the re module as described at this link. Just feed the input with a generator from your file to avoid reading the whole file at once.

Python equivalent of ruby's StringScanner?

回答3:

You can read file in chunks with file.read(size). I would not recomment however to read by 1 byte, as this will drastically affect performance. Following snippet (not much tested, use on your own risk) reads file in chunks an yields numbers. You'll have to read through file first to determine rows starting position though.

def values_chunks(file_object, pos_from=0, chunk_size=32*1024):
    file_object.seek(pos_from)
    eol = False
    tail = ''
    while True:
        raw_data = file_object.read(chunk_size)
        raw_data = tail + raw_data
        raw_data = raw_data.split('\n', 1) # to check for eol, split in tuple
        if len(raw_data) > 1:
            eol = True
        raw_data = raw_data[0]
        raw_values = raw_data.split()
        if not eol and raw_data[-1] != ' ':
            tail = raw_values[-1]
            raw_values = raw_values[:-1]
        else:
            tail = ''
        for value in raw_values: # either case we need only first tuple elem
            yield int(value)
        if not raw_data[0] or eol: # eof/eol
            break

>>> with open('test', 'wb') as test:
...     test.write(' '.join(map(str, range(10**5))))
...     test.write('\n')
...     test.write(' '.join(map(str, range(10**4))))
...
>>> values = list(values_chunks(open('test', 'rb')))
>>> len(values)
100000
>>> sum(values)
4999950000L

来源：https://stackoverflow.com/questions/20019503/how-to-read-tokens-without-reading-whole-line-or-file

标签

python

file-io