merge dataframes with timestamps and intervals

雨燕双飞 提交于 2020-07-03 00:58:10

问题


I have two dataframes.

df1contains number and timestamps. It is a very large set.

df1.head()
Out[292]: 
2016-08-31 08:09:00     1.0
2016-08-31 08:11:00     7.0
2016-08-31 08:14:00    90.0

df2 contains intervals

d2.head()
Out[292]:        
   start                 stop                         C
2016-08-31 08:09:00     2016-08-31 08:12:00         'a'
2016-08-31 08:13:00     2016-08-31 08:20:00         'b'
2016-08-31 08:20:00     2016-08-31 08:45:00         'c'

I would like to add a new column C to df1 such that the value of C is corresponding to the value in df2 for the interval which contained the index of df1.

Expected result

 df1.head()
    Out[292]:                      C
    2016-08-31 08:09:00     1.0   'a'
    2016-08-31 08:11:00     7.0   'a'
    2016-08-31 08:14:00    90.0   'b'

So far, I have tried:

 df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti<df2.stop)].C)

but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.

How to do this efficiently?


回答1:


Create IntervalIndex by IntervalIndex.from_arrays first:

s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00], 
               [2016-08-31 08:13:00, 2016-08-31 08:20:00],
               [2016-08-31 08:20:00, 2016-08-31 08:45:00]],
              closed='both',
              dtype='interval[datetime64[ns]]')

Then set_index by new IntervalIndex set to new column by array created by values:

df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

EDIT:

s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64), 
                                 df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000], 
               [1472631180000000000, 1472631600000000000], 
               [1472631600000000000, 1472633100000000000]],
              closed='both',
              dtype='interval[int64]')

df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b


来源:https://stackoverflow.com/questions/54418055/merge-dataframes-with-timestamps-and-intervals

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!