Number of unique elements per row in a NumPy array

后端 未结 4 631
执念已碎
执念已碎 2021-01-15 00:41

For example, for

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

I want to get

[2, 2, 3]

Is there a way

4条回答
  •  长情又很酷
    2021-01-15 01:07

    Approach #1

    One vectorized approach with sorting -

    In [8]: b = np.sort(a,axis=1)
    
    In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
    Out[9]: array([2, 2, 3])
    

    Approach #2

    Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -

    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
    

    Runtime test

    Approaches as funcs -

    def sorting(a):
        b = np.sort(a,axis=1)
        return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
    
    def bincount(a):
        n = a.max()+1
        a_off = a+(np.arange(a.shape[0])[:,None])*n
        M = a.shape[0]*n
        return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
    
    # From @wim's post   
    def pandas(a):
        df = pd.DataFrame(a.T)
        return df.nunique()
    
    # @jp_data_analysis's soln
    def numpy_apply(a):
        return np.apply_along_axis(compose(len, np.unique), 1, a) 
    

    Case #1 : Square shaped one

    In [164]: np.random.seed(0)
    
    In [165]: a = np.random.randint(0,5,(10000,10000))
    
    In [166]: %timeit numpy_apply(a)
         ...: %timeit sorting(a)
         ...: %timeit bincount(a)
         ...: %timeit pandas(a)
    1 loop, best of 3: 1.82 s per loop
    1 loop, best of 3: 1.93 s per loop
    1 loop, best of 3: 354 ms per loop
    1 loop, best of 3: 879 ms per loop
    

    Case #2 : Large number of rows

    In [167]: np.random.seed(0)
    
    In [168]: a = np.random.randint(0,5,(1000000,10))
    
    In [169]: %timeit numpy_apply(a)
         ...: %timeit sorting(a)
         ...: %timeit bincount(a)
         ...: %timeit pandas(a)
    1 loop, best of 3: 8.42 s per loop
    10 loops, best of 3: 153 ms per loop
    10 loops, best of 3: 66.8 ms per loop
    1 loop, best of 3: 53.6 s per loop
    

    Extending to number of unique elements per column

    To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -

    def nunique_percol_sort(a):
        b = np.sort(a,axis=0)
        return (b[1:] != b[:-1]).sum(axis=0)+1
    
    def nunique_percol_bincount(a):
        n = a.max()+1
        a_off = a+(np.arange(a.shape[1]))*n
        M = a.shape[1]*n
        return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
    

    Generic ndarray with generic axis

    Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -

    def nunique(a, axis):
        return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
    

    Sample runs -

    In [77]: a
    Out[77]: 
    array([[1, 0, 2, 2, 0],
           [1, 0, 1, 2, 0],
           [0, 0, 0, 0, 2],
           [1, 2, 1, 0, 1],
           [2, 0, 1, 0, 0]])
    
    In [78]: nunique(a, axis=0)
    Out[78]: array([3, 2, 3, 2, 3])
    
    In [79]: nunique(a, axis=1)
    Out[79]: array([3, 3, 2, 3, 3])
    

    If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -

    (~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
    a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
    

    For a custom tolerance value, feed those with np.isclose.

提交回复
热议问题