Why is collections.Counter much slower than ''.count?

前端 未结 2 1106
你的背包
你的背包 2020-12-20 18:44

I have a simple task: To count how many times every letter occurs in a string. I\'ve used a Counter() for it, but on one forum I saw information that using

相关标签:
2条回答
  • 2020-12-20 19:28

    Counter() allows you to count any hashable objects, not just substrings. Both solutions are O(n)-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter() is greater than running s.count() 4 times.

    Counter() can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):

        while (1) {
            key = PyIter_Next(it);
            if (key == NULL)
                break;
            oldval = PyObject_GetItem(mapping, key);
            if (oldval == NULL) {
                if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
                    break;
                PyErr_Clear();
                Py_INCREF(one);
                newval = one;
            } else {
                newval = PyNumber_Add(oldval, one);
                Py_DECREF(oldval);
                if (newval == NULL)
                    break;
            }
            if (PyObject_SetItem(mapping, key, newval) == -1)
                break;
            Py_CLEAR(newval);
            Py_DECREF(key);
        }
    

    Compare it to FASTSEARCH() overhead for bytestrings:

        for (i = 0; i < n; i++)
            if (s[i] == p[0]) {
               count++;
               if (count == maxcount)
                  return maxcount;
            }
        return count;
    
    0 讨论(0)
  • 2020-12-20 19:32

    The Counter class inherits from dict, while string.count is the following C-implementation (CPython 3.3):

    /* stringlib: count implementation */
    
    #ifndef STRINGLIB_FASTSEARCH_H
    #error must include "stringlib/fastsearch.h" before including this module
    #endif
    
    
    Py_LOCAL_INLINE(Py_ssize_t)
    STRINGLIB(count)(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                    const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
                    Py_ssize_t maxcount)
    {
        Py_ssize_t count;
    
        if (str_len < 0)
            return 0; /* start > len(str) */
        if (sub_len == 0)
            return (str_len < maxcount) ? str_len + 1 : maxcount;
    
        count = FASTSEARCH(str, str_len, sub, sub_len, maxcount, FAST_COUNT);
    
        if (count < 0)
            return 0; /* no match */
    
        return count;
    }
    

    Guess, which one is faster? :)

    0 讨论(0)
提交回复
热议问题