I want to merge two dataframes on three columns: email, subject and timestamp. The timestamps between the dataframes differ and I therefore need to identify the closest mat
Notice that if you merge df1 and df2 on email and subject, then the result
has all the possible relevant timestamp pairings:
In [108]: result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']); result
Out[108]:
timestamp email subject timestamp_y clicks var1
0 2016-07-01 10:17:00 a@gmail.com subject3 2016-07-01 10:17:39 1 7
1 2016-07-01 10:17:00 a@gmail.com subject3 2016-07-01 14:46:01 1 2
2 2016-07-01 02:01:02 a@gmail.com welcome 2016-07-01 02:01:14 1 1
3 2016-07-01 14:45:04 a@gmail.com subject3 2016-07-01 10:17:39 1 7
4 2016-07-01 14:45:04 a@gmail.com subject3 2016-07-01 14:46:01 1 2
5 2016-07-01 08:14:02 a@gmail.com subject2 2016-07-01 08:15:48 2 2
6 2016-07-01 16:26:35 a@gmail.com subject4 2016-07-01 16:27:28 1 2
7 2016-07-01 10:17:00 b@gmail.com subject3 2016-07-01 10:17:05 0 0
8 2016-07-01 10:17:00 b@gmail.com subject3 2016-07-01 14:45:05 0 0
9 2016-07-01 02:01:02 b@gmail.com welcome 2016-07-01 02:01:03 0 0
10 2016-07-01 14:45:04 b@gmail.com subject3 2016-07-01 10:17:05 0 0
11 2016-07-01 14:45:04 b@gmail.com subject3 2016-07-01 14:45:05 0 0
12 2016-07-01 08:14:02 b@gmail.com subject2 2016-07-01 08:16:00 0 0
13 2016-07-01 16:26:35 b@gmail.com subject4 2016-07-01 17:00:00 0 0
You could now take the absolute value of the difference in timestamps for each row:
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
and then use
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx]
to find the rows with the minimum difference for each group based on ['timestamp','email','subject'].
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject3
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject2
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject3
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
"""
b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,a@gmail.com,welcome,1,1
2016-07-01 08:15:48,a@gmail.com,subject2,2,2
2016-07-01 10:17:39,a@gmail.com,subject3,1,7
2016-07-01 14:46:01,a@gmail.com,subject3,1,2
2016-07-01 16:27:28,a@gmail.com,subject4,1,2
2016-07-01 10:17:05,b@gmail.com,subject3,0,0
2016-07-01 02:01:03,b@gmail.com,welcome,0,0
2016-07-01 14:45:05,b@gmail.com,subject3,0,0
2016-07-01 08:16:00,b@gmail.com,subject2,0,0
2016-07-01 17:00:00,b@gmail.com,subject4,0,0
"""
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])
result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y'])
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx].drop(['timestamp_y','diff'], axis=1)
result = result.sort_index()
print(result)
yields
timestamp email subject clicks var1
0 2016-07-01 10:17:00 a@gmail.com subject3 1 7
2 2016-07-01 02:01:02 a@gmail.com welcome 1 1
4 2016-07-01 14:45:04 a@gmail.com subject3 1 2
5 2016-07-01 08:14:02 a@gmail.com subject2 2 2
6 2016-07-01 16:26:35 a@gmail.com subject4 1 2
7 2016-07-01 10:17:00 b@gmail.com subject3 0 0
9 2016-07-01 02:01:02 b@gmail.com welcome 0 0
11 2016-07-01 14:45:04 b@gmail.com subject3 0 0
12 2016-07-01 08:14:02 b@gmail.com subject2 0 0
13 2016-07-01 16:26:35 b@gmail.com subject4 0 0