Numpy grouping using itertools.groupby performance

前端 未结 10 973
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答
  •  一向
    一向 (楼主)
    2020-12-01 03:28

    This is a numpy solution:

    def group():
        import numpy as np
        values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    
        # we sort in place
        values.sort()
    
        # when sorted the number of occurences for a unique element is the index of 
        # the first occurence when searching from the right - the index of the first
        # occurence when searching from the left.
        #
        # np.dstack() is the numpy equivalent to Python's zip()
    
        l = np.dstack((values, values.searchsorted(values, side='right') - \
                       values.searchsorted(values, side='left')))
    
        index = np.fromiter(l, dtype='u4,u2')
    
    if __name__=='__main__':
        from timeit import Timer
        t = Timer("group()","from __main__ import group")
        print t.timeit(number=1)
    

    Runs in about 25 seconds on my machine compared to about 96 for your initial solution (which is a nice improvement).

    There might be still room for improvement, I don't use numpy that often.

    Edit: added some comments in code.

提交回复
热议问题