I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C
This is a numpy solution:
def group():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
# we sort in place
values.sort()
# when sorted the number of occurences for a unique element is the index of
# the first occurence when searching from the right - the index of the first
# occurence when searching from the left.
#
# np.dstack() is the numpy equivalent to Python's zip()
l = np.dstack((values, values.searchsorted(values, side='right') - \
values.searchsorted(values, side='left')))
index = np.fromiter(l, dtype='u4,u2')
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
Runs in about 25 seconds on my machine compared to about 96 for your initial solution (which is a nice improvement).
There might be still room for improvement, I don't use numpy that often.
Edit: added some comments in code.