Fastest way to extract dictionary of sums in numpy in 1 I/O pass

问题

Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

回答1:

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

回答2:

Here's a NumPy based approach using np.add.reduceat -

sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)

If you would like to get the keys and values in a 2-column array -

out = np.column_stack((keys,vals)) # If you

Sample run -

In [351]: arr
Out[351]: 
array([[ 1, 20,  5],
       [ 1, 20,  8],
       [ 3, 10,  4],
       [ 2, 30,  6],
       [ 3, 10,  5]])

In [352]: out
Out[352]: 
array([[ 1, 13],
       [ 2,  6],
       [ 3,  9]])

回答3:

This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)

import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])

Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.

回答4:

The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:

cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
            if cum_sum != 0}

where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.

Alternatively you can do things properly and call np.unique first and do:

uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}

来源：https://stackoverflow.com/questions/38496418/fastest-way-to-extract-dictionary-of-sums-in-numpy-in-1-i-o-pass

标签

python

numpy

pandas

dictionary

vectorization