Assuming I have a numpy array like: [1,2,3,4,5,6] and another array: [0,0,1,2,2,1] I want to sum the items in the first array by group (the second array) and obtain n-groups
This is a vectorized method of doing this sum based on the implementation of numpy.unique. According to my timings it is up to 500 times faster than the loop method and up to 100 times faster than the histogram method.
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
I tried scripts from everyone and my considerations are:
Joe: Will only work if you have few groups.
kevpie: Too slow because of loops (this is not pythonic way)
Bi_Rico and Sven: perform good, but will only work for Int32 (if the sum goes over 2^32/2 it will fail)
Alex: is the fastest one, good for sum.
But if you want more flexibility and the possibility to group by other statistics use SciPy:
from scipy import ndimage
data = np.arange(10000000)
groups = np.arange(1000).repeat(10000)
ndimage.sum(data, groups, range(1000))
This is good because you have many statistics to group (sum, mean, variance, ...).