This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...
Custom hash table search with externalized strings
To get fast access time and a lower memory consumption you could do the following:
index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list. position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)
Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:
from collections import namedtuple
Node = namedtuple('Node', ['pos', 'next'])
def build_table(f, size):
table = [ None ] * size
while True:
pos = f.tell()
line = f.readline()
if not line: break
i = hash(line) % size
if table[i] is None:
table[i] = pos
else:
table[i] = Node(pos, table[i])
return table
def search(string, table, f):
i = hash(string) % len(table)
entry = table[i]
while entry is not None:
pos = entry.pos if isinstance(entry, Node) else entry
f.seek(pos)
if f.readline() == string:
return True
entry = entry.next if isinstance(entry, Node) else None
return False
SIZE = 2**24
with open('data.txt', 'r') as f:
table = build_table(f, SIZE)
print search('Some test string\n', table, f)
The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).
For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)
Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes: 64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints: 138MB (6000000 * 24)
-----------------
TOTAL: 330MB (real memory usage of python process was ~350MB)