Numpy grouping using itertools.groupby performance

前端未结

关注

 10  952

庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答

[愿得一人] (楼主)

2020-12-01 03:35

You could try the following (ab)use of scipy.sparse:

from scipy import sparse
def sparse_bincount(values):
    M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
    M.sum_duplicates()
    index = np.empty(len(M.indices),dtype='u4,u2')
    index['f0'] = M.indices
    index['f1']= M.data
    return index

This is slower than the winning answer, perhaps because scipy currently doesn't support unsigned as indices types...

0 讨论(0)

查看其它10个回答