Searching for a string in a large text file - profiling various methods in python

后端未结

关注

 6  559

傲寒 2020-12-02 05:54

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

6条回答

醉酒成梦 (楼主)

2020-12-02 06:41
Custom hash table search with externalized strings

To get fast access time and a lower memory consumption you could do the following:
- for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
- to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.
Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)

Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:
```
from collections import namedtuple 
Node = namedtuple('Node', ['pos', 'next'])

def build_table(f, size):
    table = [ None ] * size
    while True:
        pos = f.tell()
        line = f.readline()
        if not line: break
        i = hash(line) % size
        if table[i] is None:
            table[i] = pos
        else:
            table[i] = Node(pos, table[i])
    return table

def search(string, table, f):
    i = hash(string) % len(table)
    entry = table[i]
    while entry is not None:
        pos = entry.pos if isinstance(entry, Node) else entry
        f.seek(pos)
        if f.readline() == string:
            return True
        entry = entry.next if isinstance(entry, Node) else None
    return False

SIZE = 2**24
with open('data.txt', 'r') as f:
    table = build_table(f, SIZE)
    print search('Some test string\n', table, f)
```
The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).

For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)
```
Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes:       64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints:   138MB (6000000 * 24)
-----------------
TOTAL:      330MB (real memory usage of python process was ~350MB)
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...