Python fastest access to line in file

后端未结

关注

 3  1900

一向 2020-12-08 01:38

I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100

3条回答

-上瘾入骨i (楼主)

2020-12-08 01:51
I ran into a similar problem as the post above, however, the solutions posted above have problems in my particular scenario; the file was too big for linecache and islice was nowhere near fast enough. I would like to offer a third (or fourth) alternative solution.

My solution is based upon the fact that we can use mmap to access a particular point in the file. We need only know where in a file that lines begin and end, then the mmap can give those to us comparably as fast as linecache. To optimize this code (see the updates):
- We use the deque class from collections to create a dynamically lengthed collection of endpoints.
- We then convert that to a list which optimizes random access to that collection.
The following is a simple wrapper for the process:
```
from collections import deque
import mmap

class fast_file():
    
    def __init__(self, file):
        self.file = file
        self.linepoints = deque()
        self.linepoints.append(0)
        pos = 0
        with open(file,'r') as fp:
            while True:
                c = fp.read(1)
                if not c:
                    break
                if c == '\n':
                    self.linepoints.append(pos)
                    pos += 1
                pos += 1
        self.fp = open(self.file,'r+b')
        self.mm = mmap.mmap(self.fp.fileno(),0 )
        self.linepoints.append(pos)
        self.linepoints = list(self.linepoints)
                
    def getline(self, i):
        return self.mm[self.linepoints[i]:self.linepoints[i+1]]          
    
    def close(self):
        self.fp.close()
        self.mm.close()
```
The caveat is that the file, mmap needs closing and the enumerating of endpoints can take some time. But it is a one-off cost. The result is something that is both fast in instantiation and in random file access, however, the output is an element of type bytes.

I tested speed by looking at accessing a sample of my large file for the first 1 million lines (out of 48mil). I ran the following to get an idea of the time took to do 10 million accesses:
```
linecache.getline("sample.txt",0)
F = fast_file("sample.txt")


sleep(1)
start = time()
for i in range(10000000):
    linecache.getline("sample.txt",1000)
print(time()-start)

>>> 6.914520740509033

sleep(1)
start = time()
for i in range(10000000):
    F.getline(1000)
print(time()-start) 

>>> 4.488042593002319

sleep(1)
start = time()
for i in range(10000000):
    F.getline(1000).decode()
print(time()-start) 

>>> 6.825756549835205
```
It's not that much faster and it takes some time to initiate (longer in fact), however, consider the fact that my original file was too large for linecache. This simple wrapper allowed me to do random accesses for lines that linecache was unable to perform on my computer (32Gb of RAM).

I think this now might be an optimal faster alternative to linecache (speeds may depend on i/o and RAM speeds), but if you have a way to improve this, please add a comment and I will update the solution accordingly.

Update

I recently replaced a list with a collections.deque which is faster.

Second Update The collections.deque is faster in the append operation, however, a list is faster for random access, hence, the conversion here from a deque to a list optimizes both random access times and instantiation. I've added sleeps in this test and the decode function in the comparison because the mmap will return bytes to make the comparison fair.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...