Numpy grouping using itertools.groupby performance

前端未结

关注

 10  973

庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答

一向 (楼主)

2020-12-01 03:28

This is a numpy solution:

def group():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')

    # we sort in place
    values.sort()

    # when sorted the number of occurences for a unique element is the index of 
    # the first occurence when searching from the right - the index of the first
    # occurence when searching from the left.
    #
    # np.dstack() is the numpy equivalent to Python's zip()

    l = np.dstack((values, values.searchsorted(values, side='right') - \
                   values.searchsorted(values, side='left')))

    index = np.fromiter(l, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

Runs in about 25 seconds on my machine compared to about 96 for your initial solution (which is a nice improvement).

There might be still room for improvement, I don't use numpy that often.

Edit: added some comments in code.

0 讨论(0)

查看其它10个回答