Numpy grouping using itertools.groupby performance

前端 未结 10 952
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

10条回答
  •  [愿得一人]
    2020-12-01 03:35

    You could try the following (ab)use of scipy.sparse:

    from scipy import sparse
    def sparse_bincount(values):
        M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
        M.sum_duplicates()
        index = np.empty(len(M.indices),dtype='u4,u2')
        index['f0'] = M.indices
        index['f1']= M.data
        return index
    

    This is slower than the winning answer, perhaps because scipy currently doesn't support unsigned as indices types...

提交回复
热议问题