Fastest way to extract dictionary of sums in numpy in 1 I/O pass

南笙酒味 提交于 2019-12-22 04:59:05

问题


Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?


回答1:


numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}




回答2:


Here's a NumPy based approach using np.add.reduceat -

sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)

If you would like to get the keys and values in a 2-column array -

out = np.column_stack((keys,vals)) # If you 

Sample run -

In [351]: arr
Out[351]: 
array([[ 1, 20,  5],
       [ 1, 20,  8],
       [ 3, 10,  4],
       [ 2, 30,  6],
       [ 3, 10,  5]])

In [352]: out
Out[352]: 
array([[ 1, 13],
       [ 2,  6],
       [ 3,  9]])



回答3:


This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)

import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])

Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.




回答4:


The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:

cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
            if cum_sum != 0}

where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.

Alternatively you can do things properly and call np.unique first and do:

uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}


来源:https://stackoverflow.com/questions/38496418/fastest-way-to-extract-dictionary-of-sums-in-numpy-in-1-i-o-pass

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!