Find the nearest point in distance for all the points in the dataset - Python

前端 未结 3 1716
栀梦
栀梦 2020-12-15 13:04

I have a dataset as follows,

Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51             


        
相关标签:
3条回答
  • 2020-12-15 13:30

    You can do this very efficiently by calling a library that implements smart algorithms for this, one example would be sklearn which has a NearestNeighbors method that does exactly this.

    Example of the code modified to do this:

    from sklearn.neighbors import NearestNeighbors
    import numpy as np
    
    def distance(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        lon1, lat1 = p1
        lon2, lat2 = p2
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = 6367 * c
        return km
    
    points = [[25.42, 55.47],
              [25.39, 55.47],
              [24.48, 54.38],
              [24.51, 54.54]]
    
    nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
    
    distances, indices = nbrs.kneighbors(points)
    
    result = distances[:, 1]
    

    which gives

    >>> result
    array([  1.889697  ,   1.889697  ,  17.88530556,  17.88530556])
    
    0 讨论(0)
  • 2020-12-15 13:38

    The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)). There is also the additional one-time cost of building the KD-tree, which requires O(N) time.

    If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)). Thus, for large enough N, the KD-tree will beat the brute force method.

    See here for more on nearest neighbor algorithms (including KD-tree).


    Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.

    scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere). So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.


    Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.

    In [180]: %timeit using_kdtree(data)
    100 loops, best of 3: 18.6 ms per loop
    
    In [181]: %timeit using_sklearn(data)
    1 loop, best of 3: 214 ms per loop
    
    In [179]: %timeit orig(data)
    1 loop, best of 3: 728 ms per loop
    

    For N = 10000:

    In [5]: %timeit using_kdtree(data)
    1 loop, best of 3: 2.78 s per loop
    
    In [6]: %timeit using_sklearn(data)
    1 loop, best of 3: 1min 15s per loop
    
    In [7]: %timeit orig(data)
    # untested; too slow
    

    Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by which using_kdtree is faster than orig will grow as N, the length of data, grows.


    import numpy as np
    import scipy.spatial as spatial
    import pandas as pd
    import sklearn.neighbors as neighbors
    from math import radians, cos, sin, asin, sqrt
    
    R = 6367
    
    def using_kdtree(data):
        "Based on https://stackoverflow.com/q/43020919/190597"
        def dist_to_arclength(chord_length):
            """
            https://en.wikipedia.org/wiki/Great-circle_distance
            Convert Euclidean chord length to great circle arc length
            """
            central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
            arclength = R*central_angle
            return arclength
    
        phi = np.deg2rad(data['Latitude'])
        theta = np.deg2rad(data['Longitude'])
        data['x'] = R * np.cos(phi) * np.cos(theta)
        data['y'] = R * np.cos(phi) * np.sin(theta)
        data['z'] = R * np.sin(phi)
        tree = spatial.KDTree(data[['x', 'y','z']])
        distance, index = tree.query(data[['x', 'y','z']], k=2)
        return dist_to_arclength(distance[:, 1])
    
    def orig(data):
        def distance(lon1, lat1, lon2, lat2):
            """
            Calculate the great circle distance between two points 
            on the earth (specified in decimal degrees)
            """
            # convert decimal degrees to radians 
            lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
            # haversine formula 
            dlon = lon2 - lon1 
            dlat = lat2 - lat1 
            a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
            c = 2 * asin(sqrt(a)) 
            km = R * c
            return km
    
        shortest_distance = []
        for i in range(len(data)):
            distance1 = []
            for j in range(len(data)):
                if i == j: continue
                distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
                                          data['Longitude'][j], data['Latitude'][j]))
            shortest_distance.append(min(distance1))
        return shortest_distance
    
    
    def using_sklearn(data):
        """
        Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
        """
        def distance(p1, p2):
            """
            Calculate the great circle distance between two points
            on the earth (specified in decimal degrees)
            """
            lon1, lat1 = p1
            lon2, lat2 = p2
            # convert decimal degrees to radians
            lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
            # haversine formula
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
            c = 2 * np.arcsin(np.sqrt(a))
            km = R * c
            return km
        points = data[['Longitude', 'Latitude']]
        nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
        distances, indices = nbrs.kneighbors(points)
        result = distances[:, 1]
        return result
    
    np.random.seed(2017)
    N = 1000
    data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
                         'Longitude':np.random.uniform(0,360,size=N)})
    
    expected = orig(data)
    for func in [using_kdtree, using_sklearn]:
        result = func(data)
        assert np.allclose(expected, result)
    
    0 讨论(0)
  • 2020-12-15 13:48

    You can use a dictionary to hash some calculations. Your code calculates the distance A to B many times (A and B being 2 arbitrary points in your dataset).

    Either implement your own cache:

    from math import radians, cos, sin, asin, sqrt
    
    dist_cache = {}
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
    
        try:
            return dist_cache[(lon1, lat1, lon2, lat2)]
        except KeyError:
            # convert decimal degrees to radians 
            lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
            # haversine formula 
            dlon = lon2 - lon1 
            dlat = lat2 - lat1 
            a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
            c = 2 * asin(sqrt(a)) 
            km = 6367 * c
            dist_cache[(lon1, lat1, lon2, lat2)] = km
            return km
    

    Or use lru_cache:

    from math import radians, cos, sin, asin, sqrt
    from functools import lru_cache
    
    @lru_cache
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        km = 6367 * c
        return km
    
    0 讨论(0)
提交回复
热议问题