Most efficient way for a lookup/search in a huge list (python)

前端未结

关注

 3  794

-- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my qu

相关标签:

3条回答

情深已故

2020-12-01 02:34

Don't create a list, create a set. It does lookups in constant time.

If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect module.

from bisect import bisect_left
def bi_contains(lst, item):
    """ efficient `item in lst` for sorted lists """
    # if item is larger than the last its not in the list, but the bisect would 
    # find `len(lst)` as the index to insert, so check that first. Else, if the 
    # item is in the list then it has to be at index bisect_left(lst, item)
    return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)

0 讨论(0)

失恋的感觉

2020-12-01 02:50

Using this program it looks like dicts are the fastes, set second, list with bi_contains third:

from datetime import datetime

def ReadWordList():
    """ Loop through each line in english.txt and add it to the list in uppercase.

    Returns:
    Returns array with all the words in english.txt.

    """
    l_words = []
    with open(r'c:\english.txt', 'r') as f_in:
        for line in f_in:
            line = line.strip().upper()
            l_words.append(line)

    return l_words

# Loop through each line in english.txt and add it to the l_words list in uppercase.
l_words = ReadWordList()
l_words = {key: None for key in l_words}
#l_words = set(l_words)
#l_words = tuple(l_words)

t1 = datetime.now()

for i in range(10000):
    #w = 'ZEBRA' in l_words
    w = bi_contains(l_words, 'ZEBRA')

t2 = datetime.now()
print('After: ' + str(t2 - t1))

# list = 41.025293 seconds
# dict = 0.001488 seconds
# set = 0.001499 seconds
# tuple = 38.975805 seconds
# list with bi_contains = 0.014000 seconds

0 讨论(0)

不要未来只要你来

2020-12-01 02:54

A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.

Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.

0 讨论(0)
发布评论:

提交评论
- 加载中...