Find the set difference between two large arrays (matrices) in Python

前端 未结 3 1827
天命终不由人
天命终不由人 2020-12-06 16:54

I have two large 2-d arrays and I\'d like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,\'rows\')

相关标签:
3条回答
  • 2020-12-06 17:43

    This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:

    >>> import numpy as np
    >>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
    >>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
    >>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
    >>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
    >>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
    array([[1, 2, 3]])
    

    You can do this in Python, but it might be slow:

    >>> import numpy as np
    >>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
    >>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
    >>> a1_rows = set(map(tuple, a1))
    >>> a2_rows = set(map(tuple, a2))
    >>> a1_rows.difference(a2_rows)
    set([(1, 2, 3)])
    
    0 讨论(0)
  • 2020-12-06 17:43

    Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff probably does).

    from numpy import *
    # Create some sample arrays
    A =random.randint(0,5,(10,3))
    B =random.randint(0,5,(10,3))
    

    As an example, this is what I got - note that there is one common element:

    >>> A
    array([[1, 0, 3],
           [0, 4, 2],
           [0, 3, 4],
           [4, 4, 2],
           [2, 0, 2],
           [4, 0, 0],
           [3, 2, 2],
           [4, 2, 3],
           [0, 2, 1],
           [2, 0, 2]])
    >>> B
    array([[4, 1, 3],
           [4, 3, 0],
           [0, 3, 3],
           [3, 0, 3],
           [3, 4, 0],
           [3, 2, 3],
           [3, 1, 2],
           [4, 1, 2],
           [0, 4, 2],
           [0, 0, 3]])
    

    We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:

    idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)
    

    As a check:

    >>> A[idx[0]]
    array([[0, 4, 2]])
    >>> B[idx[1]]
    array([[0, 4, 2]])
    
    0 讨论(0)
  • 2020-12-06 17:44

    I'm not sure what you are going for, but this will get you a boolean array of where 2 arrays are not equal, and will be numpy fast:

    
    import numpy as np
    a = np.random.randn(5, 5)
    b = np.random.randn(5, 5)
    a[0,0] = 10.0
    b[0,0] = 10.0 
    a[1,1] = 5.0
    b[1,1] = 5.0
    c = ~(a-b==0)
    print c

    [[False True True True True] [ True False True True True] [ True True True True True] [ True True True True True] [ True True True True True]]

    0 讨论(0)
提交回复
热议问题