Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)

后端 未结 2 1439
南方客
南方客 2020-12-16 03:43

I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.

df = pd.DataFrame([{\'city\':\"Berlin\", \'lat\':52.524         


        
相关标签:
2条回答
  • 2020-12-16 03:45

    You can use:

    from math import radians, cos, sin, asin, sqrt
    
    def haversine(lon1, lat1, lon2, lat2):
    
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers. Use 3956 for miles
        return c * r
    

    First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

    df['tmp'] = 1
    df = pd.merge(df,df,on='tmp')
    df = df[df.city_x != df.city_y]
    print (df)
        city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
    1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
    2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
    3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
    5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
    6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
    7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566
    

    Then apply haversine function:

    df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                                row['lat_x'], 
                                                row['lng_y'], 
                                                row['lat_y']), axis=1)
    

    Filter distance:

    df = df[df.dist < 500]
    print (df)
        city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
    1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
    2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
    3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
    5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
    6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
    7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120
    

    And last create list or get size with groupby:

    df1 = df.groupby('city_x')['city_y'].apply(list)
    print (df1)
    city_x
    Berlin     [Potsdam, Hamburg]
    Hamburg     [Berlin, Potsdam]
    Potsdam     [Berlin, Hamburg]
    Name: city_y, dtype: object
    
    df2 = df.groupby('city_x')['city_y'].size()
    print (df2)
    city_x
    Berlin     2
    Hamburg    2
    Potsdam    2
    dtype: int64
    

    Also is possible use numpy haversine solution:

    def haversine_np(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
    
        All args must be of equal length.    
    
        """
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    
        dlon = lon2 - lon1
        dlat = lat2 - lat1
    
        a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    
        c = 2 * np.arcsin(np.sqrt(a))
        km = 6367 * c
        return km
    
    df['tmp'] = 1
    df = pd.merge(df,df,on='tmp')
    df = df[df.city_x != df.city_y]
    #print (df)
    
    df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
        city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
    1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
    2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
    3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
    5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
    6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
    7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890
    
    0 讨论(0)
  • 2020-12-16 03:51

    UPDATE: I would suggest first to buiuld a distance DataFrame:

    from scipy.spatial.distance import squareform, pdist
    from itertools import combinations
    
    # see definition of "haversine_np()" below     
    x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
                     index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))
    

    efficiently produces pairwise distance DF (without duplicates):

    In [106]: x
    Out[106]:
                           dist
    Berlin  Potsdam   27.198616
            Hamburg  255.063541
    Potsdam Hamburg  242.311890
    

    Old answer:

    Here is a bit optimized version, which uses scipy.spatial.distance.pdist method:

    from scipy.spatial.distance import squareform, pdist
    
    # slightly modified version: of http://stackoverflow.com/a/29546836/2901002
    def haversine_np(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
    
        All args must be of equal length.    
    
        """
        lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
                                             p2[0], p2[1]])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
    
        a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    
        c = 2 * np.arcsin(np.sqrt(a))
        km = 6367 * c
        return km
    
    x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
                     columns=df.city.unique(),
                     index=df.city.unique())
    

    this gives us:

    In [78]: x
    Out[78]:
                 Berlin     Potsdam     Hamburg
    Berlin     0.000000   27.198616  255.063541
    Potsdam   27.198616    0.000000  242.311890
    Hamburg  255.063541  242.311890    0.000000
    

    let's count number of cities where the distance is greater than 30:

    In [81]: x.groupby(level=0, as_index=False) \
        ...:  .apply(lambda c: c[c>30].notnull().sum(1)) \
        ...:  .reset_index(level=0, drop=True)
    Out[81]:
    Berlin     1
    Hamburg    2
    Potsdam    1
    dtype: int64
    
    0 讨论(0)
提交回复
热议问题