Numpy grouping using itertools.groupby performance

前端 未结 10 957
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答
  •  独厮守ぢ
    2020-12-01 03:46

    Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int) for keeping counts of the integers and just iterate over the array once:

    counts = collections.defaultdict(int)
    for v in values:
        counts[v] += 1
    

    This is theoretically faster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.

    Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).

提交回复
热议问题