Numpy grouping using itertools.groupby performance

前端 未结 10 980
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答
  •  孤独总比滥情好
    2020-12-01 03:30

    This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:

    def group_by_edge():
        import numpy as np
        values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
        values.sort()
        edges = (values[1:] != values[:-1]).nonzero()[0] - 1
        idx = np.concatenate(([0], edges, [len(values)]))
        index = np.empty(len(idx) - 1, dtype= 'u4, u2')
        index['f0'] = values[idx[:-1]]
        index['f1'] = np.diff(idx)
    

    This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diff approach is a bit opaque at first glance.

提交回复
热议问题