Numpy grouping using itertools.groupby performance

前端未结

关注

 10  980

庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答

孤独总比滥情好 (楼主)

2020-12-01 03:30
This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:
```
def group_by_edge():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    edges = (values[1:] != values[:-1]).nonzero()[0] - 1
    idx = np.concatenate(([0], edges, [len(values)]))
    index = np.empty(len(idx) - 1, dtype= 'u4, u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)
```
This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diff approach is a bit opaque at first glance.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...