Python: count occurrences in a list using dict comprehension/generator

后端 未结 3 800
长情又很酷
长情又很酷 2020-12-05 21:38

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.

To test

3条回答
  •  借酒劲吻你
    2020-12-05 21:59

    You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):

    >>> words = list('asdsadDASDFASCSAASAS')
    >>> dct = {}
    >>> {w: 1 if w not in dct and not dct.update({w: 1})
                      else dct[w] + 1
                      if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    >>> dct
    {'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}
    

    Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:

    from itertools import groupby
    words.sort()
    dct = {k: sum(1 for _ in g) for k, g in groupby(words)}
    

    Note that the fastest one of the lot is collections.defaultdict:

    d = defaultdict(int)
    for w in words: d[w] += 1 
    

    Timing comparisons:

    >>> from string import ascii_letters, digits
    >>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
    10 loops, best of 3: 131 ms per loop
    >>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
    10 loops, best of 3: 169 ms per loop
    >>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    1 loops, best of 3: 315 ms per loop
    >>> %%timeit
    ... words = list(ascii_letters+digits)*10**4
    ... d = defaultdict(int)
    ... for w in words: d[w] += 1
    ... 
    10 loops, best of 3: 57.1 ms per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**4
    d = {}
    for w in words: d[w] = d.get(w, 0) + 1
    ... 
    10 loops, best of 3: 108 ms per loop
    
    #Increase input size 
    
    >>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
    1 loops, best of 3: 1.44 s per loop
    >>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
    1 loops, best of 3: 1.7 s per loop
    >>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    
    1 loops, best of 3: 3.19 s per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**5
    d = defaultdict(int)
    for w in words: d[w] += 1
    ... 
    1 loops, best of 3: 571 ms per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**5
    d = {}
    for w in words: d[w] = d.get(w, 0) + 1
    ... 
    1 loops, best of 3: 1.1 s per loop
    

提交回复
热议问题