About the speed of random file read (Python)

Please take a look at the following code (kind of pseudo code):

index = db.open()
fh = open('somefile.txt','rb')
for i in range(1000):
    x = random_integer(1,5000)
    pos,length = index[x]
    fh.seek(pos)
    buffer = fh.read(length)

    doSomeThingWith(buffer)

fh.close()
db.close()

I used a database to index the positions and lengths of text segments in a .txt file for random retrieval.

No wonder, if the above code is run repeatedly, the execution takes less and less time.

1) What is responsible for this speed-up? Is it because of things staying in the memory or the "caching" or something?

2) Is there anyway to control it?

3) I've compared with other methods where the text segments are stored in Berkeley DB and so on. When at its fastest, the above code is faster than retrieval from Berkeley DB. How do I judge the performance of my database+file solution? I mean, is it safe to judge it as at least "fast enough"?

dbort

what is behind and responsible for this speed-up?

It could be the operating system's disk cache. http://en.wikipedia.org/wiki/Page_cache

Once you've read a chunk of a file from disk once, it will hang around in RAM for a while. RAM is orders of magnitude faster than disk, so you'll see a lot of variability in the time it takes to read random pieces of a large file.

Or, depending on what "db" is, the database implementation could be doing its own caching.

Is there anyway to control it?

If it's the disk cache:

It depends on the operating system, but it's typically a pretty coarse-grained control; for example, you may be forced to disable caching for an entire volume, which would affect other processes on the system reading from that volume, and would affect every other file that lived on that volume. It would also probably require root/admin access.

See this similar question about disabling caching on Linux: Linux : Disabling File cache for a process?

Depending on what you're trying to do, you can force-flush the disk cache. This can be useful in situations where you want to run a test with a cold cache, letting you get an idea of the worst-case performance. (This also depends on your OS and may require root/admin access.)

If it's the database:

Depends on the database. If it's a local database, you may just be seeing disk cache effects, or the database library could be doing its own caching. If you're talking to a remote database, the caching could be happening locally or remotely (or both).

There may be configuration options to disable or control caching at either of these layers.

来源：https://stackoverflow.com/questions/19614120/about-the-speed-of-random-file-read-python

标签

python

random-access