Vectorized spatial distance in python using numpy

前端 未结 1 1573
日久生厌
日久生厌 2020-11-30 14:56

i have numpy array in python which contains lots (10k+) of 3D vertex points (vectors with coordinates [x,y,z]). I need to calculate distance between all possible pairs of th

相关标签:
1条回答
  • 2020-11-30 15:31

    There's eucl_dist package (disclaimer: I am its author) that basically contains two methods to solve the problem of computing squared euclidean distances that are more efficient than SciPy's cdist, especially for large arrays ( with decent to large number of columns).

    We will use some of the codes from its source code to adapt to our problem here to give us two approaches.

    Approach #1

    Following the wiki contents, we could leverage matrix-multiplication and some NumPy specific implementations for our first approach, like so -

    def pdist_squareformed_numpy(a):
        a_sumrows = np.einsum('ij,ij->i',a,a)
        dist = a_sumrows[:,None] + a_sumrows -2*np.dot(a,a.T)
        np.fill_diagonal(dist,0)
        return dist
    

    Approach #2

    One more method would be to create "extended" versions of input arrays, again discussed in detail in that github source code link to have our second approach, which is better for lesser columns, as is the case here, like so -

    def ext_arrs(A,B, precision="float64"):
        nA,dim = A.shape
        A_ext = np.ones((nA,dim*3),dtype=precision)
        A_ext[:,dim:2*dim] = A
        A_ext[:,2*dim:] = A**2
    
        nB = B.shape[0]
        B_ext = np.ones((dim*3,nB),dtype=precision)
        B_ext[:dim] = (B**2).T
        B_ext[dim:2*dim] = -2.0*B.T
        return A_ext, B_ext
    
    def pdist_squareformed_numpy_v2(a):
        A_ext, B_ext = ext_arrs(a,a)
        dist = A_ext.dot(B_ext)
        np.fill_diagonal(dist,0)
        return dist
    

    Note that these give us squared eucludean distances. So, for the actual distances, we want to use np.sqrt() if that's the final output needed.

    Sample runs -

    In [380]: np.random.seed(0)
         ...: a = np.random.rand(5,3)
    
    In [381]: from scipy.spatial.distance import cdist
    
    In [382]: cdist(a,a)
    Out[382]: 
    array([[0.  , 0.29, 0.42, 0.2 , 0.57],
           [0.29, 0.  , 0.58, 0.42, 0.76],
           [0.42, 0.58, 0.  , 0.45, 0.9 ],
           [0.2 , 0.42, 0.45, 0.  , 0.51],
           [0.57, 0.76, 0.9 , 0.51, 0.  ]])
    
    In [383]: np.sqrt(pdist_squareformed_numpy(a))
    Out[383]: 
    array([[0.  , 0.29, 0.42, 0.2 , 0.57],
           [0.29, 0.  , 0.58, 0.42, 0.76],
           [0.42, 0.58, 0.  , 0.45, 0.9 ],
           [0.2 , 0.42, 0.45, 0.  , 0.51],
           [0.57, 0.76, 0.9 , 0.51, 0.  ]])
    
    In [384]: np.sqrt(pdist_squareformed_numpy_v2(a))
    Out[384]: 
    array([[0.  , 0.29, 0.42, 0.2 , 0.57],
           [0.29, 0.  , 0.58, 0.42, 0.76],
           [0.42, 0.58, 0.  , 0.45, 0.9 ],
           [0.2 , 0.42, 0.45, 0.  , 0.51],
           [0.57, 0.76, 0.9 , 0.51, 0.  ]])
    

    Timings on 10k points -

    In [385]: a = np.random.rand(10000,3)
    
    In [386]: %timeit cdist(a,a)
    1 loop, best of 3: 309 ms per loop
    
    # Approach #1
    In [388]: %timeit pdist_squareformed_numpy(a) # squared eucl distances
    1 loop, best of 3: 668 ms per loop
    
    In [389]: %timeit np.sqrt(pdist_squareformed_numpy(a)) # actual eucl distances
    1 loop, best of 3: 812 ms per loop
    
    # Approach #2
    In [390]: %timeit pdist_squareformed_numpy_v2(a) # squared eucl distances
    1 loop, best of 3: 237 ms per loop
    
    In [391]: %timeit np.sqrt(pdist_squareformed_numpy_v2(a)) # actual eucl distances
    1 loop, best of 3: 395 ms per loop
    

    The second approach seems close to cdist one on performance!

    0 讨论(0)
提交回复
热议问题