pandas merge dataframes on closest timestamp

匿名 (未验证) 提交于 2019-12-03 02:29:01

问题:

I want to merge two dataframes on three columns: email, subject and timestamp. The timestamps between the dataframes differ and I therefore need to identify the closest matching timestamp for a group of email & subject.

Below is a reproducible example using a function for closest match suggested for this question.

import numpy as np import pandas as pd from pandas.io.parsers import StringIO  def find_closest_date(timepoint, time_series, add_time_delta_column=True):    # takes a pd.Timestamp() instance and a pd.Series with dates in it    # calcs the delta between `timepoint` and each date in `time_series`    # returns the closest date and optionally the number of days in its time delta    deltas = np.abs(time_series - timepoint)    idx_closest_date = np.argmin(deltas)    res = {"closest_date": time_series.ix[idx_closest_date]}    idx = ['closest_date']    if add_time_delta_column:       res["closest_delta"] = deltas[idx_closest_date]       idx.append('closest_delta')    return pd.Series(res, index=idx)   a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject3 2016-07-01 02:01:02,a@gmail.com,welcome 2016-07-01 14:45:04,a@gmail.com,subject3 2016-07-01 08:14:02,a@gmail.com,subject2 2016-07-01 16:26:35,a@gmail.com,subject4 2016-07-01 10:17:00,b@gmail.com,subject3 2016-07-01 02:01:02,b@gmail.com,welcome 2016-07-01 14:45:04,b@gmail.com,subject3 2016-07-01 08:14:02,b@gmail.com,subject2 2016-07-01 16:26:35,b@gmail.com,subject4 """  b = """timestamp,email,subject,clicks,var1 2016-07-01 02:01:14,a@gmail.com,welcome,1,1 2016-07-01 08:15:48,a@gmail.com,subject2,2,2 2016-07-01 10:17:39,a@gmail.com,subject3,1,7 2016-07-01 14:46:01,a@gmail.com,subject3,1,2 2016-07-01 16:27:28,a@gmail.com,subject4,1,2 2016-07-01 10:17:05,b@gmail.com,subject3,0,0 2016-07-01 02:01:03,b@gmail.com,welcome,0,0 2016-07-01 14:45:05,b@gmail.com,subject3,0,0 2016-07-01 08:16:00,b@gmail.com,subject2,0,0 2016-07-01 17:00:00,b@gmail.com,subject4,0,0 """ 

Notice that for a@gmail.com the closest matched timestamp is 10:17:39, whereas for b@gmail.com the closest match is 10:17:05.

a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject3 2016-07-01 10:17:00,b@gmail.com,subject3 """  b = """timestamp,email,subject,clicks,var1 2016-07-01 10:17:39,a@gmail.com,subject3,1,7 2016-07-01 10:17:05,b@gmail.com,subject3,0,0 """ df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp']) df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])  df1[['closest', 'time_bt_x_and_y']] = df1.timestamp.apply(find_closest_date, args=[df2.timestamp]) df1  df3 = pd.merge(df1, df2, left_on=['email','subject','closest'], right_on=['email','subject','timestamp'],how='left')  df3 timestamp_x        email   subject             closest  time_bt_x_and_y         timestamp_y  clicks  var1   2016-07-01 10:17:00  a@gmail.com  subject3 2016-07-01 10:17:05         00:00:05                 NaT     NaN   NaN   2016-07-01 02:01:02  a@gmail.com   welcome 2016-07-01 02:01:03         00:00:01                 NaT     NaN   NaN   2016-07-01 14:45:04  a@gmail.com  subject3 2016-07-01 14:45:05         00:00:01                 NaT     NaN   NaN   2016-07-01 08:14:02  a@gmail.com  subject2 2016-07-01 08:15:48         00:01:46 2016-07-01 08:15:48     2.0   2.0   2016-07-01 16:26:35  a@gmail.com  subject4 2016-07-01 16:27:28         00:00:53 2016-07-01 16:27:28     1.0   2.0   2016-07-01 10:17:00  b@gmail.com  subject3 2016-07-01 10:17:05         00:00:05 2016-07-01 10:17:05     0.0   0.0   2016-07-01 02:01:02  b@gmail.com   welcome 2016-07-01 02:01:03         00:00:01 2016-07-01 02:01:03     0.0   0.0   2016-07-01 14:45:04  b@gmail.com  subject3 2016-07-01 14:45:05         00:00:01 2016-07-01 14:45:05     0.0   0.0   2016-07-01 08:14:02  b@gmail.com  subject2 2016-07-01 08:15:48         00:01:46                 NaT     NaN   NaN   2016-07-01 16:26:35  b@gmail.com  subject4 2016-07-01 16:27:28         00:00:53                 NaT     NaN   NaN 

The result is wrong, mainly because the closest date is incorrect since it does not take into account email & subject.

The expected result is

Amending the function to give the closest timesstamps for a given email and subject would be helpful.

df1.groupby(['email','subject'])['timestamp'].apply(find_closest_date, args=[df1.timestamp]) 

But that gives an error as the function is not defined for a group object. What's the best way of doing this?

回答1:

Notice that if you merge df1 and df2 on email and subject, then the result has all the possible relevant timestamp pairings:

In [108]: result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']); result Out[108]:               timestamp        email   subject         timestamp_y  clicks  var1 0  2016-07-01 10:17:00  a@gmail.com  subject3 2016-07-01 10:17:39       1     7 1  2016-07-01 10:17:00  a@gmail.com  subject3 2016-07-01 14:46:01       1     2 2  2016-07-01 02:01:02  a@gmail.com   welcome 2016-07-01 02:01:14       1     1 3  2016-07-01 14:45:04  a@gmail.com  subject3 2016-07-01 10:17:39       1     7 4  2016-07-01 14:45:04  a@gmail.com  subject3 2016-07-01 14:46:01       1     2 5  2016-07-01 08:14:02  a@gmail.com  subject2 2016-07-01 08:15:48       2     2 6  2016-07-01 16:26:35  a@gmail.com  subject4 2016-07-01 16:27:28       1     2 7  2016-07-01 10:17:00  b@gmail.com  subject3 2016-07-01 10:17:05       0     0 8  2016-07-01 10:17:00  b@gmail.com  subject3 2016-07-01 14:45:05       0     0 9  2016-07-01 02:01:02  b@gmail.com   welcome 2016-07-01 02:01:03       0     0 10 2016-07-01 14:45:04  b@gmail.com  subject3 2016-07-01 10:17:05       0     0 11 2016-07-01 14:45:04  b@gmail.com  subject3 2016-07-01 14:45:05       0     0 12 2016-07-01 08:14:02  b@gmail.com  subject2 2016-07-01 08:16:00       0     0 13 2016-07-01 16:26:35  b@gmail.com  subject4 2016-07-01 17:00:00       0     0 

You could now take the absolute value of the difference in timestamps for each row:

result['diff'] = (result['timestamp_y'] - result['timestamp']).abs() 

and then use

idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin() result = result.loc[idx] 

to find the rows with the minimum difference for each group based on ['timestamp','email','subject'].


import numpy as np import pandas as pd from pandas.io.parsers import StringIO  a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject3 2016-07-01 02:01:02,a@gmail.com,welcome 2016-07-01 14:45:04,a@gmail.com,subject3 2016-07-01 08:14:02,a@gmail.com,subject2 2016-07-01 16:26:35,a@gmail.com,subject4 2016-07-01 10:17:00,b@gmail.com,subject3 2016-07-01 02:01:02,b@gmail.com,welcome 2016-07-01 14:45:04,b@gmail.com,subject3 2016-07-01 08:14:02,b@gmail.com,subject2 2016-07-01 16:26:35,b@gmail.com,subject4 """  b = """timestamp,email,subject,clicks,var1 2016-07-01 02:01:14,a@gmail.com,welcome,1,1 2016-07-01 08:15:48,a@gmail.com,subject2,2,2 2016-07-01 10:17:39,a@gmail.com,subject3,1,7 2016-07-01 14:46:01,a@gmail.com,subject3,1,2 2016-07-01 16:27:28,a@gmail.com,subject4,1,2 2016-07-01 10:17:05,b@gmail.com,subject3,0,0 2016-07-01 02:01:03,b@gmail.com,welcome,0,0 2016-07-01 14:45:05,b@gmail.com,subject3,0,0 2016-07-01 08:16:00,b@gmail.com,subject2,0,0 2016-07-01 17:00:00,b@gmail.com,subject4,0,0 """  df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp']) df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])  result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']) result['diff'] = (result['timestamp_y'] - result['timestamp']).abs() idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin() result = result.loc[idx].drop(['timestamp_y','diff'], axis=1) result = result.sort_index() print(result) 

yields

             timestamp        email   subject  clicks  var1 0  2016-07-01 10:17:00  a@gmail.com  subject3       1     7 2  2016-07-01 02:01:02  a@gmail.com   welcome       1     1 4  2016-07-01 14:45:04  a@gmail.com  subject3       1     2 5  2016-07-01 08:14:02  a@gmail.com  subject2       2     2 6  2016-07-01 16:26:35  a@gmail.com  subject4       1     2 7  2016-07-01 10:17:00  b@gmail.com  subject3       0     0 9  2016-07-01 02:01:02  b@gmail.com   welcome       0     0 11 2016-07-01 14:45:04  b@gmail.com  subject3       0     0 12 2016-07-01 08:14:02  b@gmail.com  subject2       0     0 13 2016-07-01 16:26:35  b@gmail.com  subject4       0     0 


回答2:

You want to apply the closest timestamp logic to each group of 'email' and 'subject'

a = """timestamp,email,subject 2016-07-01 10:17:00,a@gmail.com,subject3 2016-07-01 02:01:02,a@gmail.com,welcome 2016-07-01 14:45:04,a@gmail.com,subject3 2016-07-01 08:14:02,a@gmail.com,subject2 2016-07-01 16:26:35,a@gmail.com,subject4 2016-07-01 10:17:00,b@gmail.com,subject3 2016-07-01 02:01:02,b@gmail.com,welcome 2016-07-01 14:45:04,b@gmail.com,subject3 2016-07-01 08:14:02,b@gmail.com,subject2 2016-07-01 16:26:35,b@gmail.com,subject4 """  b = """timestamp,email,subject,clicks,var1 2016-07-01 02:01:14,a@gmail.com,welcome,1,1 2016-07-01 08:15:48,a@gmail.com,subject2,2,2 2016-07-01 10:17:39,a@gmail.com,subject3,1,7 2016-07-01 14:46:01,a@gmail.com,subject3,1,2 2016-07-01 16:27:28,a@gmail.com,subject4,1,2 2016-07-01 10:17:05,b@gmail.com,subject3,0,0 2016-07-01 02:01:03,b@gmail.com,welcome,0,0 2016-07-01 14:45:05,b@gmail.com,subject3,0,0 2016-07-01 08:16:00,b@gmail.com,subject2,0,0 2016-07-01 17:00:00,b@gmail.com,subject4,0,0 """  df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp']) df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp']) df2 = df2.set_index(['email', 'subject'])  def find_closest_date(timepoint, time_series, add_time_delta_column=True):     # takes a pd.Timestamp() instance and a pd.Series with dates in it     # calcs the delta between `timepoint` and each date in `time_series`     # returns the closest date and optionally the number of days in its time delta     time_series = time_series.values     timepoint = np.datetime64(timepoint)     deltas = np.abs(np.subtract(time_series, timepoint))     idx_closest_date = np.argmin(deltas)     res = {"closest_date": time_series[idx_closest_date]}     idx = ['closest_date']     if add_time_delta_column:         res["closest_delta"] = deltas[idx_closest_date]         idx.append('closest_delta')     return pd.Series(res, index=idx)  # Then group df1 as needed grouped = df1.groupby(['email', 'subject'])  # Finally loop over the group items, finding the closest timestamps join_ts = pd.DataFrame() for name, group in grouped:     try:         join_ts = pd.concat([join_ts, group['timestamp']\                              .apply(find_closest_date, time_series=df2.loc[name, 'timestamp'])],                             axis=0)     except KeyError:         pass  df3 = pd.merge(pd.concat([df1, join_ts], axis=1), df2, left_on=['closest_date'], right_on=['timestamp']) 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!