I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C
I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.
def group():
import sys
import numpy as np
from collections import Counter
values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
c = Counter(values)
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.