Python: count occurrences in a list using dict comprehension/generator

后端 未结 3 788
长情又很酷
长情又很酷 2020-12-05 21:38

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.

To test

相关标签:
3条回答
  • 2020-12-05 21:59

    You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):

    >>> words = list('asdsadDASDFASCSAASAS')
    >>> dct = {}
    >>> {w: 1 if w not in dct and not dct.update({w: 1})
                      else dct[w] + 1
                      if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    >>> dct
    {'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}
    

    Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:

    from itertools import groupby
    words.sort()
    dct = {k: sum(1 for _ in g) for k, g in groupby(words)}
    

    Note that the fastest one of the lot is collections.defaultdict:

    d = defaultdict(int)
    for w in words: d[w] += 1 
    

    Timing comparisons:

    >>> from string import ascii_letters, digits
    >>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
    10 loops, best of 3: 131 ms per loop
    >>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
    10 loops, best of 3: 169 ms per loop
    >>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    1 loops, best of 3: 315 ms per loop
    >>> %%timeit
    ... words = list(ascii_letters+digits)*10**4
    ... d = defaultdict(int)
    ... for w in words: d[w] += 1
    ... 
    10 loops, best of 3: 57.1 ms per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**4
    d = {}
    for w in words: d[w] = d.get(w, 0) + 1
    ... 
    10 loops, best of 3: 108 ms per loop
    
    #Increase input size 
    
    >>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
    1 loops, best of 3: 1.44 s per loop
    >>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
    1 loops, best of 3: 1.7 s per loop
    >>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
    
    1 loops, best of 3: 3.19 s per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**5
    d = defaultdict(int)
    for w in words: d[w] += 1
    ... 
    1 loops, best of 3: 571 ms per loop
    >>> %%timeit
    words = list(ascii_letters+digits)*10**5
    d = {}
    for w in words: d[w] = d.get(w, 0) + 1
    ... 
    1 loops, best of 3: 1.1 s per loop
    
    0 讨论(0)
  • 2020-12-05 22:02

    It is a use case where comprehension is not adapted/efficient.

    Comprehension is good when you can build the collection in one single operation. It is not really the case here, since :

    • either you take the words as they come and change values in the dict accordingly
    • or you have to first compute the key set (Rawing solution), but then you browse the list once for getting the key set, and once per key

    IMHO, the most efficient way is the iterative one.

    0 讨论(0)
  • 2020-12-05 22:09

    You can do it this way:

    >>> words=['this','that','is','if','that','is','if','this','that']
    >>> {i:words.count(i) for i in words}
    {'this': 2, 'is': 2, 'if': 2, 'that': 3}
    
    0 讨论(0)
提交回复
热议问题