I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C
You could try the following (ab)use of scipy.sparse
:
from scipy import sparse
def sparse_bincount(values):
M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
M.sum_duplicates()
index = np.empty(len(M.indices),dtype='u4,u2')
index['f0'] = M.indices
index['f1']= M.data
return index
This is slower than the winning answer, perhaps because scipy
currently doesn't support unsigned as indices types...