Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)

后端 未结 2 1473
南方客
南方客 2020-12-16 03:43

I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.

df = pd.DataFrame([{\'city\':\"Berlin\", \'lat\':52.524         


        
2条回答
  •  被撕碎了的回忆
    2020-12-16 03:51

    UPDATE: I would suggest first to buiuld a distance DataFrame:

    from scipy.spatial.distance import squareform, pdist
    from itertools import combinations
    
    # see definition of "haversine_np()" below     
    x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
                     index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))
    

    efficiently produces pairwise distance DF (without duplicates):

    In [106]: x
    Out[106]:
                           dist
    Berlin  Potsdam   27.198616
            Hamburg  255.063541
    Potsdam Hamburg  242.311890
    

    Old answer:

    Here is a bit optimized version, which uses scipy.spatial.distance.pdist method:

    from scipy.spatial.distance import squareform, pdist
    
    # slightly modified version: of http://stackoverflow.com/a/29546836/2901002
    def haversine_np(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
    
        All args must be of equal length.    
    
        """
        lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
                                             p2[0], p2[1]])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
    
        a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    
        c = 2 * np.arcsin(np.sqrt(a))
        km = 6367 * c
        return km
    
    x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
                     columns=df.city.unique(),
                     index=df.city.unique())
    

    this gives us:

    In [78]: x
    Out[78]:
                 Berlin     Potsdam     Hamburg
    Berlin     0.000000   27.198616  255.063541
    Potsdam   27.198616    0.000000  242.311890
    Hamburg  255.063541  242.311890    0.000000
    

    let's count number of cities where the distance is greater than 30:

    In [81]: x.groupby(level=0, as_index=False) \
        ...:  .apply(lambda c: c[c>30].notnull().sum(1)) \
        ...:  .reset_index(level=0, drop=True)
    Out[81]:
    Berlin     1
    Hamburg    2
    Potsdam    1
    dtype: int64
    

提交回复
热议问题