Fastest pairwise distance metric in python

前端 未结 3 1010
天涯浪人
天涯浪人 2020-11-30 06:47

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it\'s inefficient

相关标签:
3条回答
  • 2020-11-30 07:19

    Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.

    Here's some code:

    import numpy as np
    import random
    import sklearn.metrics.pairwise
    import scipy.spatial.distance
    
    r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
    c = r[:, None]
    
    def option1(r):
        dists = np.abs(r - r[:, None])
    
    def option2(r):
        dists = scipy.spatial.distance.pdist(r, 'cityblock')
    
    def option3(r):
        dists = sklearn.metrics.pairwise.manhattan_distances(r)
    

    Timing with IPython:

    In [36]: timeit option1(r)
    100 loops, best of 3: 5.31 ms per loop
    
    In [37]: timeit option2(c)
    1000 loops, best of 3: 1.84 ms per loop
    
    In [38]: timeit option3(c)
    100 loops, best of 3: 11.5 ms per loop
    

    I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

    0 讨论(0)
  • 2020-11-30 07:23

    Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.

    I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:

    File my_cython.pyx:

    import numpy as np
    cimport numpy as np
    import cython
    
    cdef extern from "math.h":
        double abs(double t)
    
    @cython.wraparound(False)
    @cython.boundscheck(False)
    def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
        cdef int i, j, c, size
        cdef np.ndarray[np.double_t, ndim=1] ans
        size = sum(range(1, r.shape[0]+1))
        ans = np.empty(size, dtype=r.dtype)
        c = -1
        for i in range(r.shape[0]):
            for j in range(i, r.shape[0]):
                c += 1
                ans[c] = abs(r[i] - r[j])
        return ans
    

    The answer is a 1-D array containing all non-repeated evaluations.

    To import into Python:

    import numpy as np
    import random
    
    import pyximport; pyximport.install()
    from my_cython import pairwise_distance
    
    r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
    
    def solOP(r):
        return np.abs(r - r[:, None])
    

    Timing with IPython:

    In [2]: timeit solOP(r)
    100 loops, best of 3: 7.38 ms per loop
    
    In [3]: timeit pairwise_distance(r)
    1000 loops, best of 3: 1.77 ms per loop
    
    0 讨论(0)
  • 2020-11-30 07:43

    Using half the memory, but 6 times slower than np.abs(r - r[:, None]):

    triu = np.triu_indices(r.shape[0],1)
    dists2 = abs(r[triu[1]]-r[triu[0]])
    
    0 讨论(0)
提交回复
热议问题