Sum array by number in numpy

前端 未结 8 1125
别跟我提以往
别跟我提以往 2020-11-30 08:51

Assuming I have a numpy array like: [1,2,3,4,5,6] and another array: [0,0,1,2,2,1] I want to sum the items in the first array by group (the second array) and obtain n-groups

相关标签:
8条回答
  • 2020-11-30 09:20

    If the groups are indexed by consecutive integers, you can abuse the numpy.histogram() function to get the result:

    data = numpy.arange(1, 7)
    groups = numpy.array([0,0,1,2,2,1])
    sums = numpy.histogram(groups, 
                           bins=numpy.arange(groups.min(), groups.max()+2), 
                           weights=data)[0]
    # array([3, 9, 9])
    

    This will avoid any Python loops.

    0 讨论(0)
  • 2020-11-30 09:23

    I noticed the numpy tag but in case you don't mind using pandas, this task becomes an one-liner:

    import pandas as pd
    import numpy as np
    
    data = np.arange(1, 7)
    groups = np.array([0, 0, 1, 2, 2, 1])
    
    df = pd.DataFrame({'data': data, 'groups': groups})
    

    So df then looks like this:

       data  groups
    0     1       0
    1     2       0
    2     3       1
    3     4       2
    4     5       2
    5     6       1
    

    Now you can use the functions groupby() and sum()

    print(df.groupby(['groups'], sort=False).sum())
    

    which gives you the desired output

            data
    groups      
    0          3
    1          9
    2          9
    

    By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

    0 讨论(0)
  • 2020-11-30 09:24

    You're all wrong! The best way to do it is:

    a = [1,2,3,4,5,6]
    ix = [0,0,1,2,2,1]
    accum = np.zeros(np.max(ix)+1)
    np.add.at(accum, ix, a)
    print accum
    > array([ 3.,  9.,  9.])
    
    0 讨论(0)
  • 2020-11-30 09:24

    A pure python implementation:

    l = [1,2,3,4,5,6]
    g = [0,0,1,2,2,1]
    
    from itertools import izip
    from operator import itemgetter
    from collections import defaultdict
    
    def group_sum(l, g):
        groups = defaultdict(int)
        for li, gi in izip(l, g):
            groups[gi] += li
        return map(itemgetter(1), sorted(groups.iteritems()))
    
    print group_sum(l, g)
    
    [3, 9, 9]
    
    0 讨论(0)
  • 2020-11-30 09:33

    There's more than one way to do this, but here's one way:

    import numpy as np
    data = np.arange(1, 7)
    groups = np.array([0,0,1,2,2,1])
    
    unique_groups = np.unique(groups)
    sums = []
    for group in unique_groups:
        sums.append(data[groups == group].sum())
    

    You can vectorize things so that there's no for loop at all, but I'd recommend against it. It becomes unreadable, and will require a couple of 2D temporary arrays, which could require large amounts of memory if you have a lot of data.

    Edit: Here's one way you could entirely vectorize. Keep in mind that this may (and likely will) be slower than the version above. (And there may be a better way to vectorize this, but it's late and I'm tired, so this is just the first thing to pop into my head...)

    However, keep in mind that this is a bad example... You're really better off (both in terms of speed and readability) with the loop above...

    import numpy as np
    data = np.arange(1, 7)
    groups = np.array([0,0,1,2,2,1])
    
    unique_groups = np.unique(groups)
    
    # Forgive the bad naming here...
    # I can't think of more descriptive variable names at the moment...
    x, y = np.meshgrid(groups, unique_groups)
    data_stack = np.tile(data, (unique_groups.size, 1))
    
    data_in_group = np.zeros_like(data_stack)
    data_in_group[x==y] = data_stack[x==y]
    
    sums = data_in_group.sum(axis=1)
    
    0 讨论(0)
  • 2020-11-30 09:36

    The numpy function bincount was made exactly for this purpose and I'm sure it will be much faster than the other methods for all sizes of inputs:

    data = [1,2,3,4,5,6]
    ids  = [0,0,1,2,2,1]
    
    np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array
    

    The i-th element of the output is the sum of all the data elements corresponding to "id" i.

    Hope that helps.

    0 讨论(0)
提交回复
热议问题