Most efficient way for a lookup/search in a huge list (python)

前端 未结 3 787
深忆病人
深忆病人 2020-12-01 01:51

-- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my qu

相关标签:
3条回答
  • 2020-12-01 02:34

    Don't create a list, create a set. It does lookups in constant time.

    If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect module.

    from bisect import bisect_left
    def bi_contains(lst, item):
        """ efficient `item in lst` for sorted lists """
        # if item is larger than the last its not in the list, but the bisect would 
        # find `len(lst)` as the index to insert, so check that first. Else, if the 
        # item is in the list then it has to be at index bisect_left(lst, item)
        return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
    
    0 讨论(0)
  • 2020-12-01 02:50

    Using this program it looks like dicts are the fastes, set second, list with bi_contains third:

    from datetime import datetime
    
    def ReadWordList():
        """ Loop through each line in english.txt and add it to the list in uppercase.
    
        Returns:
        Returns array with all the words in english.txt.
    
        """
        l_words = []
        with open(r'c:\english.txt', 'r') as f_in:
            for line in f_in:
                line = line.strip().upper()
                l_words.append(line)
    
        return l_words
    
    # Loop through each line in english.txt and add it to the l_words list in uppercase.
    l_words = ReadWordList()
    l_words = {key: None for key in l_words}
    #l_words = set(l_words)
    #l_words = tuple(l_words)
    
    t1 = datetime.now()
    
    for i in range(10000):
        #w = 'ZEBRA' in l_words
        w = bi_contains(l_words, 'ZEBRA')
    
    t2 = datetime.now()
    print('After: ' + str(t2 - t1))
    
    # list = 41.025293 seconds
    # dict = 0.001488 seconds
    # set = 0.001499 seconds
    # tuple = 38.975805 seconds
    # list with bi_contains = 0.014000 seconds
    
    0 讨论(0)
  • A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.

    Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.

    0 讨论(0)
提交回复
热议问题