pandas merge dataframes on closest timestamp

前端 未结 2 1141
日久生厌
日久生厌 2021-01-13 22:44

I want to merge two dataframes on three columns: email, subject and timestamp. The timestamps between the dataframes differ and I therefore need to identify the closest mat

2条回答
  •  猫巷女王i
    2021-01-13 23:32

    Notice that if you merge df1 and df2 on email and subject, then the result has all the possible relevant timestamp pairings:

    In [108]: result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']); result
    Out[108]: 
                 timestamp        email   subject         timestamp_y  clicks  var1
    0  2016-07-01 10:17:00  a@gmail.com  subject3 2016-07-01 10:17:39       1     7
    1  2016-07-01 10:17:00  a@gmail.com  subject3 2016-07-01 14:46:01       1     2
    2  2016-07-01 02:01:02  a@gmail.com   welcome 2016-07-01 02:01:14       1     1
    3  2016-07-01 14:45:04  a@gmail.com  subject3 2016-07-01 10:17:39       1     7
    4  2016-07-01 14:45:04  a@gmail.com  subject3 2016-07-01 14:46:01       1     2
    5  2016-07-01 08:14:02  a@gmail.com  subject2 2016-07-01 08:15:48       2     2
    6  2016-07-01 16:26:35  a@gmail.com  subject4 2016-07-01 16:27:28       1     2
    7  2016-07-01 10:17:00  b@gmail.com  subject3 2016-07-01 10:17:05       0     0
    8  2016-07-01 10:17:00  b@gmail.com  subject3 2016-07-01 14:45:05       0     0
    9  2016-07-01 02:01:02  b@gmail.com   welcome 2016-07-01 02:01:03       0     0
    10 2016-07-01 14:45:04  b@gmail.com  subject3 2016-07-01 10:17:05       0     0
    11 2016-07-01 14:45:04  b@gmail.com  subject3 2016-07-01 14:45:05       0     0
    12 2016-07-01 08:14:02  b@gmail.com  subject2 2016-07-01 08:16:00       0     0
    13 2016-07-01 16:26:35  b@gmail.com  subject4 2016-07-01 17:00:00       0     0
    

    You could now take the absolute value of the difference in timestamps for each row:

    result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
    

    and then use

    idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
    result = result.loc[idx]
    

    to find the rows with the minimum difference for each group based on ['timestamp','email','subject'].


    import numpy as np
    import pandas as pd
    from pandas.io.parsers import StringIO
    
    a = """timestamp,email,subject
    2016-07-01 10:17:00,a@gmail.com,subject3
    2016-07-01 02:01:02,a@gmail.com,welcome
    2016-07-01 14:45:04,a@gmail.com,subject3
    2016-07-01 08:14:02,a@gmail.com,subject2
    2016-07-01 16:26:35,a@gmail.com,subject4
    2016-07-01 10:17:00,b@gmail.com,subject3
    2016-07-01 02:01:02,b@gmail.com,welcome
    2016-07-01 14:45:04,b@gmail.com,subject3
    2016-07-01 08:14:02,b@gmail.com,subject2
    2016-07-01 16:26:35,b@gmail.com,subject4
    """
    
    b = """timestamp,email,subject,clicks,var1
    2016-07-01 02:01:14,a@gmail.com,welcome,1,1
    2016-07-01 08:15:48,a@gmail.com,subject2,2,2
    2016-07-01 10:17:39,a@gmail.com,subject3,1,7
    2016-07-01 14:46:01,a@gmail.com,subject3,1,2
    2016-07-01 16:27:28,a@gmail.com,subject4,1,2
    2016-07-01 10:17:05,b@gmail.com,subject3,0,0
    2016-07-01 02:01:03,b@gmail.com,welcome,0,0
    2016-07-01 14:45:05,b@gmail.com,subject3,0,0
    2016-07-01 08:16:00,b@gmail.com,subject2,0,0
    2016-07-01 17:00:00,b@gmail.com,subject4,0,0
    """
    
    df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
    df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])
    
    result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y'])
    result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
    idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
    result = result.loc[idx].drop(['timestamp_y','diff'], axis=1)
    result = result.sort_index()
    print(result)
    

    yields

                 timestamp        email   subject  clicks  var1
    0  2016-07-01 10:17:00  a@gmail.com  subject3       1     7
    2  2016-07-01 02:01:02  a@gmail.com   welcome       1     1
    4  2016-07-01 14:45:04  a@gmail.com  subject3       1     2
    5  2016-07-01 08:14:02  a@gmail.com  subject2       2     2
    6  2016-07-01 16:26:35  a@gmail.com  subject4       1     2
    7  2016-07-01 10:17:00  b@gmail.com  subject3       0     0
    9  2016-07-01 02:01:02  b@gmail.com   welcome       0     0
    11 2016-07-01 14:45:04  b@gmail.com  subject3       0     0
    12 2016-07-01 08:14:02  b@gmail.com  subject2       0     0
    13 2016-07-01 16:26:35  b@gmail.com  subject4       0     0
    

提交回复
热议问题