Fastest way to merge pandas dataframe on ranges

后端 未结 3 930
小蘑菇
小蘑菇 2020-12-09 12:25

I have a dataframe A

    ip_address
0   13
1   5
2   20
3   11
.. ........

and another dataframe B



        
相关标签:
3条回答
  • 2020-12-09 12:48

    Try pd.merge_asof

    df['lowerbound_ip_address']=df['ip_address']
    pd.merge_asof(df1,df,on='lowerbound_ip_address',direction ='forward',allow_exact_matches =False)
    Out[811]: 
       lowerbound_ip_address  upperbound_ip_address    country  ip_address
    0                      0                     10  Australia           5
    1                     11                     20      China          13
    
    0 讨论(0)
  • 2020-12-09 12:49

    Use pd.IntervalIndex

    In [2503]: s = pd.IntervalIndex.from_arrays(dfb.lowerbound_ip_address,
                                                dfb.upperbound_ip_address, 'both')
    
    In [2504]: dfa.assign(country=dfb.set_index(s).loc[dfa.ip_address].country.values)
    Out[2504]:
       ip_address    country
    0          13      China
    1           5  Australia
    2          20      China
    3          11      China
    

    Details

    In [2505]: s
    Out[2505]:
    IntervalIndex([[0, 10], [11, 20]]
                  closed='both',
                  dtype='interval[int64]')
    
    In [2507]: dfb.set_index(s)
    Out[2507]:
              lowerbound_ip_address  upperbound_ip_address    country
    [0, 10]                       0                     10  Australia
    [11, 20]                     11                     20      China
    
    In [2506]: dfb.set_index(s).loc[dfa.ip_address]
    Out[2506]:
              lowerbound_ip_address  upperbound_ip_address    country
    [11, 20]                     11                     20      China
    [0, 10]                       0                     10  Australia
    [11, 20]                     11                     20      China
    [11, 20]                     11                     20      China
    

    Setup

    In [2508]: dfa
    Out[2508]:
       ip_address
    0          13
    1           5
    2          20
    3          11
    
    In [2509]: dfb
    Out[2509]:
       lowerbound_ip_address  upperbound_ip_address    country
    0                      0                     10  Australia
    1                     11                     20      China
    
    0 讨论(0)
  • 2020-12-09 12:54

    IntervalIndex is as of pandas 0.20.0 and the solution by @JohnGalt using it is excellent.

    Prior to that version, this solution would work which expands the ip addresses by country for the complete range.

    df_ip = pd.concat([pd.DataFrame(
        {'ip_address': range(row['lowerbound_ip_address'], row['upperbound_ip_address'] + 1), 
         'country': row['country']}) 
        for _, row in dfb.iterrows()]).set_index('ip_address')
    >>> dfa.set_index('ip_address').join(df_ip)
                  country
    ip_address           
    13              China
    5           Australia
    20              China
    11              China
    
    0 讨论(0)
提交回复
热议问题