Is there a way to make collections.Counter (Python2.7) aware that its input list is sorted?

后端 未结 3 661
礼貌的吻别
礼貌的吻别 2020-12-16 21:25

The Problem

I\'ve been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comp

相关标签:
3条回答
  • 2020-12-16 21:43

    Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby?

    from itertools import groupby
    some_data = ['a', 'a', 'b', 'c', 'c', 'c']
    count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
    count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
    # {'a': 2, 'c': 3, 'b': 1}
    
    0 讨论(0)
  • 2020-12-16 21:57

    To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.

    The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:

    sorted                  3.19
    count_words_Counter     2.88
    count_words_defaultdict 2.45
    count_words_dict        2.58
    count_words_groupby     3.44
    count_words_groupby_sum 3.52
    

    Also the counting of words in already sorted input with groupby() takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.

    def count_words_Counter(words):
        return sorted(Counter(words).items())
    
    def count_words_groupby(words):
        return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]
    
    def count_words_groupby_sum(words):
        return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]
    
    def count_words_defaultdict(words):
        d = defaultdict(int)
        for w in words:
            d[w] += 1
        return sorted(d.items())
    
    def count_words_dict(words):
        d = {}
        for w in words:
            try:
                d[w] += 1
            except KeyError:
                d[w] = 1
        return sorted(d.items())
    
    def _count_words_freqdist(words):
        # note: .items() returns words sorted by word frequency (descreasing order)
        #       (same as `Counter.most_common()`)
        #       so the code sorts twice (the second time in lexicographical order)
        return sorted(nltk.FreqDist(words).items())
    

    To reproduce the results, run this code.

    Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words()) but relative performance is the same:

    sorted                  1.22
    count_words_Counter     0.86
    count_words_defaultdict 0.48
    count_words_dict        0.54
    count_words_groupby     1.49
    count_words_groupby_sum 1.55
    

    The results are similar to Python - Is a dictionary slow to find frequency of each character?.

    If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:

    def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
        """
        >>> toascii_letter_lower_genexpr("ABC,-.!def")
        'abcdef'
        """
        return ''.join(c for c in s.lower() if c in _letter_set)
    
    def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
        return ''.join(c for c in s.lower() if c in _letter_set)
    
    def toascii_letter_lower_translate(s,
        table=maketrans(ascii_letters, ascii_lowercase * 2),
        deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
        return s.translate(table, deletechars)
    
    def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
        return filter(_letter_set.__contains__, s).lower()
    

    To count and normalize the words simultaneously:

    def combine_counts(items):
        d = defaultdict(int)
        for word, count in items:
            d[word] += count
        return d.iteritems()
    
    def clean_words_in_items(clean_word, items):
        return ((clean_word(word), count) for word, count in items)
    
    def normalize_count_words(words):
        """Normalize then count words."""
        return count_words_defaultdict(imap(toascii_letter_lower_translate, words))
    
    def count_normalize_words(words):
        """Count then normalize words."""
        freqs = count_words_defaultdict(words)
        freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
        return sorted(combine_counts(freqs))
    

    Results

    I've updated the benchmark to measure various combinations of count_words*() and toascii*() functions (5x4 pairs not shown):

    toascii_letter_lower_filter      0.954 usec small
    toascii_letter_lower_genexpr     2.44 usec small
    toascii_letter_lower_genexpr_set 2.19 usec small
    toascii_letter_lower_translate   0.633 usec small
    
    toascii_letter_lower_filter      124 usec random 2000
    toascii_letter_lower_genexpr     197 usec random 2000
    toascii_letter_lower_genexpr_set 121 usec random 2000
    toascii_letter_lower_translate   7.73 usec random 2000
    
    sorted                  1.28 sec 
    count_words_Counter     941 msec 
    count_words_defaultdict 501 msec 
    count_words_dict        571 msec 
    count_words_groupby     1.56 sec 
    count_words_groupby_sum 1.64 sec 
    
    count_normalize_words 622 msec 
    normalize_count_words 2.18 sec 
    

    The fastest methods:

    • normalize words - toascii_letter_lower_translate()

    • count words (presorted input) - groupby()-based approach

    • count words - count_words_defaultdict()

    • it is faster first to count the words and then to normalize them - count_normalize_words()

    Latest version of the code: count-words-performance.py.

    0 讨论(0)
  • 2020-12-16 21:59

    One source of inefficiency in the OP's code (which several answers fixed without commenting on) is the over-reliance on intermediate lists. There is no reason to create a temporary list of millions of words just to iterate over them, when a generator will do.

    So instead of

    cnt = Counter()
    for word in [token.lower().strip(drop) for token in corpus]:
        cnt[word] += 1
    

    it should be just

    cnt = Counter(token.lower().strip(drop) for token in corpus)
    

    And if you really want to sort the word counts alphabetically (what on earth for?), replace this

    wordfreqs = sorted([(word, cnt[word]) for word in cnt])
    

    with this:

    wordfreqs = sorted(cnt.items())   # In Python 2: cnt.iteritems()
    

    This should remove much of the inefficiency around the use of Counter (or any dictionary class used in a similar way).

    0 讨论(0)
提交回复
热议问题