file.tell() inconsistency

我的未来我决定 提交于 2019-11-26 03:57:39

问题


Does anybody happen to know why when you iterate over a file this way:

Input:

f = open(\'test.txt\', \'r\')
for line in f:
    print \"f.tell(): \",f.tell()

Output:

f.tell(): 8192
f.tell(): 8192
f.tell(): 8192
f.tell(): 8192

I consistently get the wrong file index from tell(), however, if I use readline I get the appropriate index for tell():

Input:

f = open(\'test.txt\', \'r\')
while True:
    line = f.readline()
    if (line == \'\'):
        break
    print \"f.tell(): \",f.tell()

Output:

f.tell(): 103
f.tell(): 107
f.tell(): 115
f.tell(): 124

I\'m running python 2.7.1 BTW.


回答1:


Using open files as an iterator uses a read-ahead buffer to increase efficiency. As a result, the file pointer advances in large steps across the file as you loop over the lines.

From the File Objects documentation:

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.

If you need to rely on .tell(), don't use the file object as an iterator. You can turn .readline() into an iterator instead (at the price of some performance loss):

for line in iter(f.readline, ''):
    print f.tell()

This uses the iter() function sentinel argument to turn any callable into an iterator.




回答2:


The answer lies in the following part of Python 2.7 source code (fileobject.c):

#define READAHEAD_BUFSIZE 8192

static PyObject *
file_iternext(PyFileObject *f)
{
    PyStringObject* l;

    if (f->f_fp == NULL)
        return err_closed();
    if (!f->readable)
        return err_mode("reading");

    l = readahead_get_line_skip(f, 0, READAHEAD_BUFSIZE);
    if (l == NULL || PyString_GET_SIZE(l) == 0) {
        Py_XDECREF(l);
        return NULL;
    }
    return (PyObject *)l;
}

As you can see, file's iterator interface reads the file in blocks of 8KB. This explains why f.tell() behaves the way it does.

The documentation suggests it's done for performance reasons (and does not guarantee any particular size of the readahead buffer).




回答3:


I experienced the same read-ahead buffer issue and solved it using Martijn's suggestion.

I've since generalized my solution for anyone else looking to do such things:

https://github.com/loisaidasam/csv-position-reader

Happy CSV parsing!



来源:https://stackoverflow.com/questions/14145082/file-tell-inconsistency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!