Check common elements of two 2D numpy arrays, either row or column wise

问题

Given two numpy arrays of nx3 and mx3, what is an efficient way to determine the row indices (counter) wherein the rows are common in the two arrays. For instance I have the following solution, which is significantly slow for not even much larger arrays

def arrangment(arr1,arr2):
    hits = []
    for i in range(arr2.shape[0]):
        current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
        x = current_row - arr1
        for j in range(arr1.shape[0]):
            if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
                hits.append(j)

    return hits

It checks if rows of arr2 exist in arr1 and returns the row indices of arr1 where the rows match. I need this arrangement to be always sequentially ascending in terms of rows of arr2. For instance given

arr1 = np.array([[-1., -1., -1.],
       [ 1., -1., -1.],
       [ 1.,  1., -1.],
       [-1.,  1., -1.],
       [-1., -1.,  1.],
       [ 1., -1.,  1.],
       [ 1.,  1.,  1.],
       [-1.,  1.,  1.]])
arr2 = np.array([[-1.,  1., -1.],
       [ 1.,  1., -1.],
       [ 1.,  1.,  1.],
       [-1.,  1.,  1.]])

The function should return:

[3, 2, 6, 7]

回答1:

quick and dirty answer

(arr1[:, None] == arr2).all(-1).argmax(0)

array([3, 2, 6, 7])

Better answer
Takes care of chance a row in arr2 doesn't match anything in arr1

t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)

array([ 3.,  2.,  6.,  7.])

As pointed out by @Divakar np.isclose accounts for rounding error in comparing floats

t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)

回答2:

I had a similar problem in the past and I came up with a fairly optimised solution for it.

First you need a generalisation of numpy.unique for multidimensional arrays, which for the sake of completeness I would copy it here

def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

        input:
            arr:
                2D array
            consider_sort:
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
            return_index:
                Similar to numpy unique
            return_inverse:
                Similar to numpy unique
        returns:
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 
            """

    if consider_sort is True:
        a = np.sort(arr,axis=1)
    else:
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse is False:
        _, idx = np.unique(b, return_index=True)
    else:
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == False and return_inverse == False:
        return arr[idx]
    elif return_index == True and return_inverse == False:
        return arr[idx], idx
    elif return_index == False and return_inverse == True:
        return arr[idx], inv
    else:
        return arr[idx], idx, inv

Now all you need is to concatenate (np.vstack) your arrays and find the unique rows. The reverse mapping together with np.searchsorted will give you the indices you need. So lets write another function similar to numpy.in2d but for multidimensional (2D) arrays

def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
    """Find the elements in arr1 which are also in 
       arr2 and sort them as the appear in arr2"""

    assert arr1.dtype == arr2.dtype

    if axis == 0:
        arr1 = np.copy(arr1.T,order='C')
        arr2 = np.copy(arr2.T,order='C')

    if consider_sort is True:
        sorter_arr1 = np.argsort(arr1)
        arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
        sorter_arr2 = np.argsort(arr2)
        arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]


    arr = np.vstack((arr1,arr2))
    _, inv = unique2d(arr, return_inverse=True)

    size1 = arr1.shape[0]
    size2 = arr2.shape[0]

    arr3 = inv[:size1]
    arr4 = inv[-size2:]

    # Sort the indices as they appear in arr2
    sorter = np.argsort(arr3)
    idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]

    return idx

Now all you need to do is call in2d_unsorted with your input parameters

>>> in2d_unsorted(arr1,arr2)
array([ 3,  2,  6,  7])

While may not be fully optimised this approach is much faster. Let's benchmark it against @piRSquareds solutions

def indices_piR(arr1,arr2):
    t = np.isclose(arr1[:, None], arr2).all(-1)
    return np.where(t.any(0), t.argmax(0), np.nan)

with the following arrays

n=150
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]

In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
True

In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop

Now, for n=1500

In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop

and for n=15000

In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop

So for largeish arrays this is over 200X faster compared to @piRSquared's vectorised solution.

来源：https://stackoverflow.com/questions/41234161/check-common-elements-of-two-2d-numpy-arrays-either-row-or-column-wise

标签

python

arrays

performance

numpy