Efficiently reading few lines from a very large binary file

Here is a simple example to illustrate my problem: I have a large binary file with 10 million values.

I want to get 5K values from certain points in this file.

I have a list of indexes giving me the exact place in the file I have my value in.

To solve this I tried two methods:

Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
```
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
    binaryFile_new.seek (size * (index), 0)
    wanted_line = binaryFile_new.read (size)
    wanted_line_list.append(wanted_line)
binaryFile_new.close()
```
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:

binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
        if i == 0:
                binaryFile_new.seek (size * (v), 0)
            else:
                binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
    binaryFile_new.seek (size * (index), 0)
    wanted_line = binaryFile_new.read (size)
    wanted_line_list.append(wanted_line)
binaryFile_new.close()

I expected the second solution to be much faster because in theory it would go through the whole file once O(N).

But for some reason both solutions run the same.

I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.

Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

I'd go with #1:

for index in index_list:
    binary_file.seek(size * index)
    # ...

(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)

as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.

Sorting the indexes so I could go through the file "once" while seeking from the current position

I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.

There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).

Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.

The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.

[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.

来源：https://stackoverflow.com/questions/57418396/efficiently-reading-few-lines-from-a-very-large-binary-file

标签

python

performance

binaryfiles