Read multiple lines from a file batch by batch

…衆ロ難τιáo~ 提交于 2020-12-29 13:18:02

问题


I would like to know is there a method that can read multiple lines from a file batch by batch. For example:

with open(filename, 'rb') as f:
    for n_lines in f:
        process(n_lines)

In this function, what I would like to do is: for every iteration, next n lines will be read from the file, batch by batch.

Because one single file is too big. What I want to do is to read it part by part.


回答1:


itertools.islice and two arg iter can be used to accomplish this, but it's a little funny:

from itertools import islice

n = 5  # Or whatever chunk size you want
with open(filename, 'rb') as f:
    for n_lines in iter(lambda: tuple(islice(f, n)), ()):
        process(n_lines)

This will keep isliceing off n lines at a time (using tuple to actually force the whole chunk to be read in) until the f is exhausted, at which point it will stop. The final chunk will be less than n lines if the number of lines in the file isn't an even multiple of n. If you want all the lines to be a single string, change the for loop to be:

    # The b prefixes are ignored on 2.7, and necessary on 3.x since you opened
    # the file in binary mode
    for n_lines in iter(lambda: b''.join(islice(f, n)), b''):

Another approach is to use izip_longest for the purpose, which avoids lambda functions:

from future_builtins import map  # Only on Py2
from itertools import izip_longest  # zip_longest on Py3

    # gets tuples possibly padded with empty strings at end of file
    for n_lines in izip_longest(*[f]*n, fillvalue=b''):

    # Or to combine into a single string:
    for n_lines in map(b''.join, izip_longest(*[f]*n, fillvalue=b'')):



回答2:


You can actually just iterate over lines in a file (see file.next docs - this also works on Python 3) like

with open(filename) as f:
    for line in f:
        something(line)

so your code can be rewritten to

n=5 # your batch size
with open(filename) as f:
    batch=[]
    for line in f:
        batch.append(line)
        if len(batch)==n:
            process(batch)
            batch=[]
process(batch) # this batch might be smaller or even empty

but normally just processing line-by-line is more convenient (first example)

If you dont care about how many lines are read exactly for each batch but just that it is not too much memory then use file.readlines with sizehint like

size_hint=2<<24 # 16MB
with open(filename) as f:
    while f: # not sure if this check works
        process(f.readlines(size_hint))


来源:https://stackoverflow.com/questions/39549426/read-multiple-lines-from-a-file-batch-by-batch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!