python counting elements in iterable with filter

浪尽此生 提交于 2019-12-20 06:23:26

问题


To count the elements in a list, you can use collections.Counter, but what if only some of the elements have to be counted?

I've set up this example (please note: numpy is just for convenience. In general the list will contain arbitrary python objects):

num_samples = 10000000
num_unique = 1000
numbers = np.random.randint(0, num_unique, num_samples)

I would like to count how often a number occurs in this list, but I'm only interested in numbers <= 10.

This is the baseline to beat. The Counter just counts everything, which should produce some overhead.

%%time
counter = Counter(numbers)

CPU times: user 1.38 s, sys: 7.49 ms, total: 1.39 s
Wall time: 1.39 s

Filtering the iterable while counting it doesn't seem possible. But the following code is very bad style, it goes through the list twice, instead of using a single loop:

%%time
numbers = [number for number in numbers if number<=10]
counter = Counter(numbers)

CPU times: user 1.3 s, sys: 22.1 ms, total: 1.32 s
Wall time: 1.33 s

That speedup is basically negligible. Let's try a single loop:

%%time

counter = defaultdict(int)
for number in numbers:
    if number > 10:
        continue
    counter[number]+=1

CPU times: user 1.99 s, sys: 11.5 ms, total: 2 s
Wall time: 2.01 s

Well my single loop is much worse. I assume that Counter profits from a C based implementation ?

The next thing I tried was switching my list expression for a generator expression. In principle this should mean that the generator is only looped through once, while it is consumed by the Counter. The numbers are disappointing though, it is basically as fast as the vanilla Counter:

%%time
iterator = (number for number in numbers if number <= 10)
counter = Counter(iterator)

CPU times: user 1.38 s, sys: 8.51 ms, total: 1.39 s
Wall time: 1.39 s

At this point I took a step back and re-ran the numbers a few times. The three Counter versions (unfiltered, list comprehension, generator expression) are almost equal in speed. The defaultdict version is consistently much slower.

How can I efficiently count elements in a python list, while filtering the elements at the same time ?


回答1:


If this is about large numpy arrays you'd better take advantage of vectorized numpy operations.

%%time
np.unique(numbers[numbers <= 10], return_counts=True)

Output:

Wall time: 31.2 ms

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([10055, 10090,  9941, 10002,  9994,  9989, 10070,  9859, 10038,
        10028,  9965], dtype=int64))

​For comparison, my own timing of your code gave slighly higher times than yours.



来源:https://stackoverflow.com/questions/54803061/python-counting-elements-in-iterable-with-filter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!