Get cumulative count per 2d array

后端 未结 3 957
伪装坚强ぢ
伪装坚强ぢ 2021-01-18 03:41

I have general data, e.g. strings:

np.random.seed(343)

arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[[\'0\' \'1\' \'1\         


        
3条回答
  •  谎友^
    谎友^ (楼主)
    2021-01-18 04:00

    General Idea

    Consider the generic case where we perform this cumulative counting or if you think of them as ranges, we could call them - Grouped ranges.

    Now, the idea starts off simple - Compare one-off slices along the respective axis to look for inequalities. Pad with True at the start of each row/col (depending on axis of counting).

    Then, it gets complicated - Setup an ID array with the intention that we would a final cumsum which would be desired output in its flattened order. So, the setup starts off with initializing a 1s array with same shape as input array. At each group start in input, offset the ID array with the previous group lengths. Follow the code (should give more insight) on how we would do it for each row -

    def grp_range_2drow(a, start=0):
        # Get grouped ranges along each row with resetting at places where
        # consecutive elements differ
        
        # Input(s) : a is 2D input array
        
        # Store shape info
        m,n = a.shape
        
        # Compare one-off slices for each row and pad with True's at starts
        # Those True's indicate start of each group
        p = np.ones((m,1),dtype=bool)
        a1 = np.concatenate((p, a[:,:-1] != a[:,1:]),axis=1)
        
        # Get indices of group starts in flattened version
        d = np.flatnonzero(a1)
    
        # Setup ID array to be cumsumed finally for desired o/p 
        # Assign into starts with previous group lengths. 
        # Thus, when cumsumed on flattened version would give us flattened desired
        # output. Finally reshape back to 2D  
        c = np.ones(m*n,dtype=int)
        c[d[1:]] = d[:-1]-d[1:]+1
        c[0] = start
        return c.cumsum().reshape(m,n)
    

    We would extend this to solve for a generic case of row and columns. For the columns case, we would simply transpose, feed to earlier row-solution and finally transpose back, like so -

    def grp_range_2d(a, start=0, axis=1):
        # Get grouped ranges along specified axis with resetting at places where
        # consecutive elements differ
        
        # Input(s) : a is 2D input array
    
        if axis not in [0,1]:
            raise Exception("Invalid axis")
    
        if axis==1:
            return grp_range_2drow(a, start=start)
        else:
            return grp_range_2drow(a.T, start=start).T
    

    Sample run

    Let's consider a sample run as would find grouped ranges along each column with each group starting with 1 -

    In [330]: np.random.seed(0)
    
    In [331]: a = np.random.randint(1,3,(10,10))
    
    In [333]: a
    Out[333]: 
    array([[1, 2, 2, 1, 2, 2, 2, 2, 2, 2],
           [2, 1, 1, 2, 1, 1, 1, 1, 1, 2],
           [1, 2, 2, 1, 1, 2, 2, 2, 2, 1],
           [2, 1, 2, 1, 2, 2, 1, 2, 2, 1],
           [1, 2, 1, 2, 2, 2, 2, 2, 1, 2],
           [1, 2, 2, 2, 2, 1, 2, 1, 1, 2],
           [2, 1, 2, 1, 2, 1, 1, 1, 1, 1],
           [2, 2, 1, 1, 1, 2, 2, 1, 2, 1],
           [1, 2, 1, 2, 2, 2, 2, 2, 2, 1],
           [2, 2, 1, 1, 2, 1, 1, 2, 2, 1]])
    
    In [334]: grp_range_2d(a, start=1, axis=0)
    Out[334]: 
    array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           [1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
           [1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
           [1, 1, 2, 2, 1, 2, 1, 2, 2, 2],
           [1, 1, 1, 1, 2, 3, 1, 3, 1, 1],
           [2, 2, 1, 2, 3, 1, 2, 1, 2, 2],
           [1, 1, 2, 1, 4, 2, 1, 2, 3, 1],
           [2, 1, 1, 2, 1, 1, 1, 3, 1, 2],
           [1, 2, 2, 1, 1, 2, 2, 1, 2, 3],
           [1, 3, 3, 1, 2, 1, 1, 2, 3, 4]])
    

    Thus, to solve our case for dataframe input & output, it would be -

    out = grp_range_2d(df.values, start=1,axis=0)
    pd.DataFrame(out,columns=df.columns,index=df.index)
    

提交回复
热议问题