Numpy grouping using itertools.groupby performance

前端 未结 10 979
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答
  •  孤独总比滥情好
    2020-12-01 03:35

    I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.

    def group():
        import sys
        import numpy as np
        from collections import Counter
        values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
        c = Counter(values)
    
    if __name__=='__main__':
        from timeit import Timer
        t = Timer("group()","from __main__ import group")
        print t.timeit(number=1)
    

    I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.

提交回复
热议问题