Pandas merge_asof() giving duplicate matches

问题

I have two dataframes with datetimes that I want to merge. Because some of the timestamps may not be exactly the same on the dataframes, I think it's best to use pandas merge_asof() function.

I want to join timestamps on the 'nearest' value but within a given tolerance (e.g. +/- 5 minutes). However, it seems that the merge_asof() function matches the timestamp with all timestamps of the 1st dataframe within the tolerance. This is better explained with the example below.

import pandas as pd

df1 = pd.date_range("2019-01-01 00:00:00", "2019-01-01 00:04:00", freq='20s')
df1 = pd.DataFrame(df1, columns=['time'])

df2 = pd.DataFrame(["2019-01-01 00:02:00"], columns=['time'])
df2['time'] = pd.to_datetime(df2['time'])
df2['df2_col'] = 'df2'

merged_df = pd.merge_asof(df1, df2, left_on='time', right_on='time',
              tolerance=pd.Timedelta('40s'),
              allow_exact_matches=True,
              direction='nearest')

print (merged_df)

Actual output:

                  time df2_col
0  2019-01-01 00:00:00     NaN
1  2019-01-01 00:00:20     NaN
2  2019-01-01 00:00:40     NaN
3  2019-01-01 00:01:00     NaN
4  2019-01-01 00:01:20     df2
5  2019-01-01 00:01:40     df2
6  2019-01-01 00:02:00     df2
7  2019-01-01 00:02:20     df2
8  2019-01-01 00:02:40     df2
9  2019-01-01 00:03:00     NaN
10 2019-01-01 00:03:20     NaN
11 2019-01-01 00:03:40     NaN
12 2019-01-01 00:04:00     NaN

Expected output:

                  time df2_col
0  2019-01-01 00:00:00     NaN
1  2019-01-01 00:00:20     NaN
2  2019-01-01 00:00:40     NaN
3  2019-01-01 00:01:00     NaN
4  2019-01-01 00:01:20     NaN
5  2019-01-01 00:01:40     NaN
6  2019-01-01 00:02:00     df2
7  2019-01-01 00:02:20     NaN
8  2019-01-01 00:02:40     NaN
9  2019-01-01 00:03:00     NaN
10 2019-01-01 00:03:20     NaN
11 2019-01-01 00:03:40     NaN
12 2019-01-01 00:04:00     NaN

Is this the expected behavior? How can I manage to get the expected result?

回答1:

The actual output is the expected behavior: merge_asof(left, right) finds for every row in left the nearest row in right (within the tolerance limits). What you want is slightly different: you want to find the one row in left that is nearest to right. I'm afraid there's no built-in function for this in pandas.

To achieve what you want you could do a reverse merge_asof(right, left) and then merge the result with left. In order to identify the row you need in the reverse merge_asofresult, we reset the index first and use this information for the second merge:

x = pd.merge_asof(df2, df1.reset_index(), left_on='time', right_on='time',
              tolerance=pd.Timedelta('40s'),
              allow_exact_matches=True,
              direction='nearest')

merged_df = df1.merge(x[['df2_col','index']], how='left', left_index=True, right_on='index').set_index('index')

Result:

                     time df2_col
index                            
0     2019-01-01 00:00:00     NaN
1     2019-01-01 00:00:20     NaN
2     2019-01-01 00:00:40     NaN
3     2019-01-01 00:01:00     NaN
4     2019-01-01 00:01:20     NaN
5     2019-01-01 00:01:40     NaN
6     2019-01-01 00:02:00     df2
7     2019-01-01 00:02:20     NaN
8     2019-01-01 00:02:40     NaN
9     2019-01-01 00:03:00     NaN
10    2019-01-01 00:03:20     NaN
11    2019-01-01 00:03:40     NaN
12    2019-01-01 00:04:00     NaN

Caveat: In our example, df1 has an unnamed index. Resetting this index turns it into a column with the default name 'index', which we use in the second merge. If, however, df1 already has a column with the name 'index' then the new column name will be 'index_0' and we'll have to use this name in the second merge instead of 'index'.

来源：https://stackoverflow.com/questions/57919854/pandas-merge-asof-giving-duplicate-matches

标签

pandas

merge