Numpy grouping using itertools.groupby performance

前端未结

关注

 10  957

庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答

独厮守ぢ (楼主)

2020-12-01 03:46
Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int) for keeping counts of the integers and just iterate over the array once:
```
counts = collections.defaultdict(int)
for v in values:
    counts[v] += 1
```
This is theoretically faster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.

Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...