问题
Inspired by this question, I'm wondering exactly what the optional buffering argument to Python's open()
function does. From looking at the source, I see that buffering
is passed into setvbuf
to set the buffer size for the stream (and that it does nothing on a system without setvbuf
, which the docs confirm).
However, when you iterate over a file, there is a constant called READAHEAD_BUFSIZE
that appears to define how much data is read at a time (this constant is defined here).
My question is exactly how the buffering
argument relates to READAHEAD_BUFSIZE
. When I iterate through a file, which one defines how much data is being read off disk at a time? And is there a place in the C source that makes this clear?
回答1:
READAHEAD_BUFSIZE
is only used when you use the file as an iterator:
for line in fileobj:
print line
It is a separate buffer from the normal buffer argument, which is handled by the fread
C API calls. Both are used when iterating.
From file.next():
In order to make a
for
loop the most efficient way of looping over the lines of a file (a very common operation), thenext()
method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combiningnext()
with other file methods (likereadline()
) does not work right. However, usingseek()
to reposition the file to an absolute position will flush the read-ahead buffer.
The OS buffer size is not changed, the setvbuf
is done when the file is opened and not touched by the file iteration code. Instead, calls to Py_UniversalNewlineFread
(which uses fread
) are used to fill the read-ahead buffer, creating a second buffer internal to Python. Python otherwise leaves the regular buffering up to the C API calls (fread()
calls are buffered; the userspace buffer is consulted by fread()
to satisfy the request, Python doesn't have to do anything about that).
readahead_get_line_skip()
then serves lines (newline terminated) from this buffer. If the buffer no longer contains newlines, it'll refill the buffer by recursing over itself with a buffer size 1.25 times the previous value. This means that file iteration can read the whole rest of the file into the memory buffer if there are no more newline characters in the whole file!
To see how much the buffer reads, print the file position (using fileobj.tell()
) while looping:
>>> with open('test.txt') as f:
... for line in f:
... print f.tell()
...
8192 # 1 times the buffer size
8192
8192
~ lines elided
18432 # + 1.25 times the buffer size
18432
18432
~ lines elided
26624 # + 1 times the buffer size; the last newline must've aligned on the buffer boundary
26624
26624
~ lines elided
36864 # + 1.25 times the buffer size
36864
36864
etc.
What bytes are actually read from the disk (provided fileobj
is an actual physical file on your disk) depend not only on the interplay between the fread()
buffer and the internal read-ahead buffer; but also if the OS itself is using buffering. It could well be that even if the file buffer is exhausted, the OS serves the system call to read from the file from it's own cache instead of going to the physical disk.
回答2:
After digging through the source a bit more and trying to understand more how setvbuf
and fread
work, I think I understand how buffering
and READAHEAD_BUFSIZE
relate to each other: when iterating through a file, a buffer of READAHEAD_BUFSIZE
is filled on each line, but filling this buffer uses calls to fread
, each of which fills a buffer of buffering
bytes.
Python's read
is implemented as file_read, which calls Py_UniversalNewlineFread, passing it the number of bytes to read as n
. Py_UniversalNewlineFread
then eventually calls fread
to read n bytes.
When you iterate over a file, the function readahead_get_line_skip is what retrieves a line. This function also calls Py_UniversalNewlineFread
, passing n = READAHEAD_BUFSIZE
. So this eventually becomes a call to fread
for READAHEAD_BUFSIZE
bytes.
So now the question is, how many bytes does fread
actually read from disk. If I run the following code in C, then 1024 bytes get copied into buf
and 512 into buf2
. (This might be obvious but never having used setvbuf
before it was a useful experiment for me.)
FILE *f = fopen("test.txt", "r");
void *buf = malloc(1024);
void *buf2 = mallo(512);
setvbuf(f, buf, _IOFBF, 1024);
fread(buf2, 512, 1, f);
So, finally, this suggests to me that when iterating over a file, at least READAHEAD_BUF_SIZE
bytes are read from disk, but it might be more. I think that the first iteration of for line in f
will read x bytes, where x is the smallest multiple of buffering
that is greater than READAHEAD_BUF_SIZE
.
If anyone can confirm that this is what's actually going on, that would be great!
来源:https://stackoverflow.com/questions/15991702/what-is-the-difference-between-the-buffering-argument-to-open-and-the-hardcode