Python: count occurrences in a list using dict comprehension/generator

后端未结

关注

 3  801

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.

To test

相关标签:

3条回答

借酒劲吻你

2020-12-05 21:59

You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):

>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
                  else dct[w] + 1
                  if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}

Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:

from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}

Note that the fastest one of the lot is collections.defaultdict:

d = defaultdict(int)
for w in words: d[w] += 1

Timing comparisons:

>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
... 
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
10 loops, best of 3: 108 ms per loop

#Increase input size 

>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}

1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
... 
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
1 loops, best of 3: 1.1 s per loop

0 讨论(0)

轮回少年

2020-12-05 22:02
It is a use case where comprehension is not adapted/efficient.

Comprehension is good when you can build the collection in one single operation. It is not really the case here, since :
- either you take the words as they come and change values in the dict accordingly
- or you have to first compute the key set (Rawing solution), but then you browse the list once for getting the key set, and once per key
IMHO, the most efficient way is the iterative one.
0 讨论(0)
发布评论:

提交评论
- 加载中...

-上瘾入骨i

2020-12-05 22:09

You can do it this way:

>>> words=['this','that','is','if','that','is','if','this','that']
>>> {i:words.count(i) for i in words}
{'this': 2, 'is': 2, 'if': 2, 'that': 3}

0 讨论(0)