I\'ve been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comp
Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby
?
from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}
To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.
The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:
sorted 3.19
count_words_Counter 2.88
count_words_defaultdict 2.45
count_words_dict 2.58
count_words_groupby 3.44
count_words_groupby_sum 3.52
Also the counting of words in already sorted input with groupby()
takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.
def count_words_Counter(words):
return sorted(Counter(words).items())
def count_words_groupby(words):
return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]
def count_words_groupby_sum(words):
return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]
def count_words_defaultdict(words):
d = defaultdict(int)
for w in words:
d[w] += 1
return sorted(d.items())
def count_words_dict(words):
d = {}
for w in words:
try:
d[w] += 1
except KeyError:
d[w] = 1
return sorted(d.items())
def _count_words_freqdist(words):
# note: .items() returns words sorted by word frequency (descreasing order)
# (same as `Counter.most_common()`)
# so the code sorts twice (the second time in lexicographical order)
return sorted(nltk.FreqDist(words).items())
To reproduce the results, run this code.
Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words())
but relative performance is the same:
sorted 1.22
count_words_Counter 0.86
count_words_defaultdict 0.48
count_words_dict 0.54
count_words_groupby 1.49
count_words_groupby_sum 1.55
The results are similar to Python - Is a dictionary slow to find frequency of each character?.
If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:
def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
"""
>>> toascii_letter_lower_genexpr("ABC,-.!def")
'abcdef'
"""
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_translate(s,
table=maketrans(ascii_letters, ascii_lowercase * 2),
deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
return s.translate(table, deletechars)
def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
return filter(_letter_set.__contains__, s).lower()
To count and normalize the words simultaneously:
def combine_counts(items):
d = defaultdict(int)
for word, count in items:
d[word] += count
return d.iteritems()
def clean_words_in_items(clean_word, items):
return ((clean_word(word), count) for word, count in items)
def normalize_count_words(words):
"""Normalize then count words."""
return count_words_defaultdict(imap(toascii_letter_lower_translate, words))
def count_normalize_words(words):
"""Count then normalize words."""
freqs = count_words_defaultdict(words)
freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
return sorted(combine_counts(freqs))
Results
I've updated the benchmark to measure various combinations of count_words*()
and toascii*()
functions (5x4 pairs not shown):
toascii_letter_lower_filter 0.954 usec small
toascii_letter_lower_genexpr 2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate 0.633 usec small
toascii_letter_lower_filter 124 usec random 2000
toascii_letter_lower_genexpr 197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate 7.73 usec random 2000
sorted 1.28 sec
count_words_Counter 941 msec
count_words_defaultdict 501 msec
count_words_dict 571 msec
count_words_groupby 1.56 sec
count_words_groupby_sum 1.64 sec
count_normalize_words 622 msec
normalize_count_words 2.18 sec
The fastest methods:
normalize words - toascii_letter_lower_translate()
count words (presorted input) - groupby()
-based approach
count words - count_words_defaultdict()
it is faster first to count the words and then to normalize them - count_normalize_words()
Latest version of the code: count-words-performance.py.
One source of inefficiency in the OP's code (which several answers fixed without commenting on) is the over-reliance on intermediate lists. There is no reason to create a temporary list of millions of words just to iterate over them, when a generator will do.
So instead of
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
it should be just
cnt = Counter(token.lower().strip(drop) for token in corpus)
And if you really want to sort the word counts alphabetically (what on earth for?), replace this
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
with this:
wordfreqs = sorted(cnt.items()) # In Python 2: cnt.iteritems()
This should remove much of the inefficiency around the use of Counter
(or any dictionary class used in a similar way).