How to merge two data frames based on nearest date

后端 未结 2 2015
日久生厌
日久生厌 2021-01-05 19:22

I want to merge two data frames based on two columns: \"Code\" and \"Date\". It is straightforward to merge data frames based on \"Code\", however in case of \"Date\" it bec

2条回答
  •  心在旅途
    2021-01-05 19:57

    I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:

    1. add a column to df1 with the closest date from the appropriate group in df2

    2. call a standard merge on these

    As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.

    I've put together one approach to that solution that should scale relatively well. First we can generate some simple data:

    import pandas as pd
    import numpy as np
    dates = pd.date_range('2015', periods=200, freq='D')
    
    rand = np.random.RandomState(42)
    i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
    i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
    
    df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                        'Date': dates[i1],
                        'val1':rand.rand(5)})
    df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                        'Date': dates[i2],
                        'val2':rand.rand(5)})
    

    Let's check these out:

    >>> df1
       Code       Date      val1
    0     0 2015-01-16  0.975852
    1     0 2015-01-31  0.516300
    2     1 2015-04-06  0.322956
    3     1 2015-05-09  0.795186
    4     1 2015-06-08  0.270832
    
    >>> df2
       Code       Date      val2
    0     1 2015-02-03  0.184334
    1     1 2015-04-13  0.080873
    2     0 2015-05-02  0.428314
    3     1 2015-06-26  0.688500
    4     0 2015-06-30  0.058194
    

    Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:

    from sklearn.neighbors import NearestNeighbors
    
    def find_nearest(group, match, groupname):
        match = match[match[groupname] == group.name]
        nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
        dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
    
        group['Date1'] = group['Date']
        group['Date'] = match['Date'].values[ind.ravel()]
        return group
    
    df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
    >>> df1_mod
       Code       Date      val1      Date1
    0     0 2015-05-02  0.975852 2015-01-16
    1     0 2015-05-02  0.516300 2015-01-31
    2     1 2015-04-13  0.322956 2015-04-06
    3     1 2015-04-13  0.795186 2015-05-09
    4     1 2015-06-26  0.270832 2015-06-08
    

    Finally, we can merge these together with a straightforward call to pd.merge:

    >>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
       Code       Date      val1      Date1      val2
    0     0 2015-05-02  0.975852 2015-01-16  0.428314
    1     0 2015-05-02  0.516300 2015-01-31  0.428314
    2     1 2015-04-13  0.322956 2015-04-06  0.080873
    3     1 2015-04-13  0.795186 2015-05-09  0.080873
    4     1 2015-06-26  0.270832 2015-06-08  0.688500
    

    Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.

提交回复
热议问题