Merging dataframes based on date range

前端未结

关注

 2  1735

醉梦人生

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) w

相关标签:

2条回答

攒了一身酷

2020-12-10 00:03

Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.

import pandas as pd
import numpy as np

# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

df1.StartDate

Out[139]: 
0   2010-01-01
1   2010-01-06
2   2010-01-11
3   2010-01-16
4   2010-01-21
5   2010-01-26
6   2010-01-31
7   2010-02-05
8   2010-02-10
Name: StartDate, dtype: datetime64[ns]

df2.date_time

Out[140]: 
0    2010-01-01
1    2010-01-02
2    2010-01-03
3    2010-01-04
4    2010-01-05
5    2010-01-06
6    2010-01-07
7    2010-01-08
8    2010-01-09
9    2010-01-10
        ...    
40   2010-02-10
41   2010-02-11
42   2010-02-12
43   2010-02-13
44   2010-02-14
45   2010-02-15
46   2010-02-16
47   2010-02-17
48   2010-02-18
49   2010-02-19
Name: date_time, dtype: datetime64[ns]


df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)


Out[143]: 
    date_time  values  ID_matched
0  2010-01-01 -0.2240           1
1  2010-01-02 -0.4202           1
2  2010-01-03  0.9998           1
3  2010-01-04  0.4310           1
4  2010-01-05 -0.6509           0
5  2010-01-06 -1.4987           2
6  2010-01-07 -1.2306           2
7  2010-01-08  0.1940           2
8  2010-01-09 -0.9984           2
9  2010-01-10 -0.3676           0
..        ...     ...         ...
40 2010-02-10  0.5242           9
41 2010-02-11  0.3451           9
42 2010-02-12  0.7244           9
43 2010-02-13 -2.0404           9
44 2010-02-14 -1.0798           0
45 2010-02-15 -0.6934           0
46 2010-02-16 -2.3380           0
47 2010-02-17  1.6623           0
48 2010-02-18 -0.2754           0
49 2010-02-19 -0.7466           0

[50 rows x 3 columns]

%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop

0 讨论(0)

庸人自扰

2020-12-10 00:07

Minor correction to @JianxunLi answer. Bit too involved for a comment.

This uses the len(funclist) == len(condlist) + 1 property of piecewise to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...

### Data / inits
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
                                     [(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
                                     np.append(df1.ID.values, valIfNoMatch))

PS. Also corrected the typo testing both >= & <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.

0 讨论(0)