I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.
To test
You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):
>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
else dct[w] + 1
if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}
Another way will be to sort the words list first then group them using itertools.groupby
and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:
from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}
Note that the fastest one of the lot is collections.defaultdict
:
d = defaultdict(int)
for w in words: d[w] += 1
Timing comparisons:
>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
...
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
...
10 loops, best of 3: 108 ms per loop
#Increase input size
>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
...
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
...
1 loops, best of 3: 1.1 s per loop
It is a use case where comprehension is not adapted/efficient.
Comprehension is good when you can build the collection in one single operation. It is not really the case here, since :
IMHO, the most efficient way is the iterative one.
You can do it this way:
>>> words=['this','that','is','if','that','is','if','this','that']
>>> {i:words.count(i) for i in words}
{'this': 2, 'is': 2, 'if': 2, 'that': 3}