Searching for a string in a large text file - profiling various methods in python

后端 未结 6 559
傲寒
傲寒 2020-12-02 05:54

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

6条回答
  •  醉酒成梦
    2020-12-02 06:41

    Custom hash table search with externalized strings

    To get fast access time and a lower memory consumption you could do the following:

    • for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
    • to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.

    Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)

    Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:

    from collections import namedtuple 
    Node = namedtuple('Node', ['pos', 'next'])
    
    def build_table(f, size):
        table = [ None ] * size
        while True:
            pos = f.tell()
            line = f.readline()
            if not line: break
            i = hash(line) % size
            if table[i] is None:
                table[i] = pos
            else:
                table[i] = Node(pos, table[i])
        return table
    
    def search(string, table, f):
        i = hash(string) % len(table)
        entry = table[i]
        while entry is not None:
            pos = entry.pos if isinstance(entry, Node) else entry
            f.seek(pos)
            if f.readline() == string:
                return True
            entry = entry.next if isinstance(entry, Node) else None
        return False
    
    SIZE = 2**24
    with open('data.txt', 'r') as f:
        table = build_table(f, SIZE)
        print search('Some test string\n', table, f)
    

    The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).

    For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)

    Hash table: 128MB (sys.getsizeof([None]*(2**24)))
    Nodes:       64MB (sys.getsizeof(Node(None, None)) * 933132)
    Pos ints:   138MB (6000000 * 24)
    -----------------
    TOTAL:      330MB (real memory usage of python process was ~350MB)
    

提交回复
热议问题