Is there a way to make collections.Counter (Python2.7) aware that its input list is sorted?

后端未结

关注

 3  665

礼貌的吻别 2020-12-16 21:25

The Problem

I\'ve been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comp

3条回答

既然无缘 (楼主)

2020-12-16 21:57

To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.

The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:

sorted                  3.19
count_words_Counter     2.88
count_words_defaultdict 2.45
count_words_dict        2.58
count_words_groupby     3.44
count_words_groupby_sum 3.52

Also the counting of words in already sorted input with groupby() takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.

def count_words_Counter(words):
    return sorted(Counter(words).items())

def count_words_groupby(words):
    return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]

def count_words_groupby_sum(words):
    return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]

def count_words_defaultdict(words):
    d = defaultdict(int)
    for w in words:
        d[w] += 1
    return sorted(d.items())

def count_words_dict(words):
    d = {}
    for w in words:
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1
    return sorted(d.items())

def _count_words_freqdist(words):
    # note: .items() returns words sorted by word frequency (descreasing order)
    #       (same as `Counter.most_common()`)
    #       so the code sorts twice (the second time in lexicographical order)
    return sorted(nltk.FreqDist(words).items())

To reproduce the results, run this code.

Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words()) but relative performance is the same:

sorted                  1.22
count_words_Counter     0.86
count_words_defaultdict 0.48
count_words_dict        0.54
count_words_groupby     1.49
count_words_groupby_sum 1.55

The results are similar to Python - Is a dictionary slow to find frequency of each character?.

If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:

def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
    """
    >>> toascii_letter_lower_genexpr("ABC,-.!def")
    'abcdef'
    """
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_translate(s,
    table=maketrans(ascii_letters, ascii_lowercase * 2),
    deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
    return s.translate(table, deletechars)

def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
    return filter(_letter_set.__contains__, s).lower()

To count and normalize the words simultaneously:

def combine_counts(items):
    d = defaultdict(int)
    for word, count in items:
        d[word] += count
    return d.iteritems()

def clean_words_in_items(clean_word, items):
    return ((clean_word(word), count) for word, count in items)

def normalize_count_words(words):
    """Normalize then count words."""
    return count_words_defaultdict(imap(toascii_letter_lower_translate, words))

def count_normalize_words(words):
    """Count then normalize words."""
    freqs = count_words_defaultdict(words)
    freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
    return sorted(combine_counts(freqs))

Results

I've updated the benchmark to measure various combinations of count_words*() and toascii*() functions (5x4 pairs not shown):

toascii_letter_lower_filter      0.954 usec small
toascii_letter_lower_genexpr     2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate   0.633 usec small

toascii_letter_lower_filter      124 usec random 2000
toascii_letter_lower_genexpr     197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate   7.73 usec random 2000

sorted                  1.28 sec 
count_words_Counter     941 msec 
count_words_defaultdict 501 msec 
count_words_dict        571 msec 
count_words_groupby     1.56 sec 
count_words_groupby_sum 1.64 sec 

count_normalize_words 622 msec 
normalize_count_words 2.18 sec

The fastest methods:

normalize words - toascii_letter_lower_translate()
count words (presorted input) - groupby()-based approach
count words - count_words_defaultdict()
it is faster first to count the words and then to normalize them - count_normalize_words()

Latest version of the code: count-words-performance.py.

0 讨论(0)

查看其它3个回答