Merging dataframes based on date range

前端 未结 2 1731
醉梦人生
醉梦人生 2020-12-09 23:32

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) w

相关标签:
2条回答
  • 2020-12-10 00:03

    Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.

    import pandas as pd
    import numpy as np
    
    # data
    # ====================================
    df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
    
    df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
    
    df1.StartDate
    
    Out[139]: 
    0   2010-01-01
    1   2010-01-06
    2   2010-01-11
    3   2010-01-16
    4   2010-01-21
    5   2010-01-26
    6   2010-01-31
    7   2010-02-05
    8   2010-02-10
    Name: StartDate, dtype: datetime64[ns]
    
    df2.date_time
    
    Out[140]: 
    0    2010-01-01
    1    2010-01-02
    2    2010-01-03
    3    2010-01-04
    4    2010-01-05
    5    2010-01-06
    6    2010-01-07
    7    2010-01-08
    8    2010-01-09
    9    2010-01-10
            ...    
    40   2010-02-10
    41   2010-02-11
    42   2010-02-12
    43   2010-02-13
    44   2010-02-14
    45   2010-02-15
    46   2010-02-16
    47   2010-02-17
    48   2010-02-18
    49   2010-02-19
    Name: date_time, dtype: datetime64[ns]
    
    
    df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
    
    
    Out[143]: 
        date_time  values  ID_matched
    0  2010-01-01 -0.2240           1
    1  2010-01-02 -0.4202           1
    2  2010-01-03  0.9998           1
    3  2010-01-04  0.4310           1
    4  2010-01-05 -0.6509           0
    5  2010-01-06 -1.4987           2
    6  2010-01-07 -1.2306           2
    7  2010-01-08  0.1940           2
    8  2010-01-09 -0.9984           2
    9  2010-01-10 -0.3676           0
    ..        ...     ...         ...
    40 2010-02-10  0.5242           9
    41 2010-02-11  0.3451           9
    42 2010-02-12  0.7244           9
    43 2010-02-13 -2.0404           9
    44 2010-02-14 -1.0798           0
    45 2010-02-15 -0.6934           0
    46 2010-02-16 -2.3380           0
    47 2010-02-17  1.6623           0
    48 2010-02-18 -0.2754           0
    49 2010-02-19 -0.7466           0
    
    [50 rows x 3 columns]
    
    %timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
    1000 loops, best of 3: 466 µs per loop
    
    0 讨论(0)
  • 2020-12-10 00:07

    Minor correction to @JianxunLi answer. Bit too involved for a comment.

    This uses the len(funclist) == len(condlist) + 1 property of piecewise to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...

    ### Data / inits
    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
    df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
    
    ### Processing
    valIfNoMatch = np.nan
    df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
                                         [(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
                                         np.append(df1.ID.values, valIfNoMatch))
    

    PS. Also corrected the typo testing both >= & <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.

    0 讨论(0)
提交回复
热议问题