group by a dataframe by values that are just less than a second off - pandas

£可爱£侵袭症+ 提交于 2021-01-28 18:30:47

问题


Let's say i have a pandas dataframe as below:

>>> df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246','2018-12-10 16:36:34.243','2018-12-10 16:38:34.216','2018-12-10 16:42:34.123']),'value':[1,2,3,4]})
>>> df
                       dt  value
0 2018-12-10 16:35:34.246      1
1 2018-12-10 16:36:34.243      2
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4
>>> 

I would like to group this dataframe by 'dt' column, but i want to group it in a way that it thinks the values that are less than a second different are the same, after grouping those i would like to sum up the 'value' column based on each group, and i want the dataframe two remain the same length, so the less than one second difference values would be all a duplicate value, i so far tried:

>>> df.groupby('dt',as_index=False)['value'].sum()
                       dt  value
0 2018-12-10 16:35:34.246      1
1 2018-12-10 16:36:34.243      2
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4
>>> 

But as you see, the dataframe didn't change because this groups by equivalent 'dt' column values.

My desired output is:

                       dt  value
0 2018-12-10 16:35:34.246      3
1 2018-12-10 16:36:34.243      3
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4

回答1:


A brute force solution is to take the absolute difference between your datetime series and each datetime value, then compare against a threshold:

# data from @StephenCowley

threshold = pd.Timedelta(seconds=1)

df['val'] = [df.loc[(df['dt'] - t).abs() < threshold, 'value'].sum()
             for t in df['dt']]

print(df)

                       dt  value  val
0 2018-12-10 16:35:34.246      1    3
1 2018-12-10 16:35:34.243      2    3
2 2018-12-10 16:38:34.216      3    3
3 2018-12-10 16:42:34.123      4    4



回答2:


(Assuming you meant the first two to have the same minute value.)

I'm not sure how to do it with groupby, but here something with the same results:

df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246',
                                      '2018-12-10 16:35:34.243',
                                      '2018-12-10 16:38:34.216',
                                      '2018-12-10 16:42:34.123']),
                                      'value':[1,2,3,4]})

            # Select the rows that are greater than a second less
            # And less than a second more
            # Get their value columns and sum them
df['val'] = [df[(df.dt>t-pd.Timedelta(seconds=1))&
                (df.dt<t+pd.Timedelta(seconds=1))]['value'].sum()
             for t in df.dt]

                       dt  value  val
0 2018-12-10 16:35:34.246      1    3
1 2018-12-10 16:35:34.243      2    3
2 2018-12-10 16:38:34.216      3    3
3 2018-12-10 16:42:34.123      4    4

As a sidenote, I looked into doing this same sort of thing with groupby but I couldn't figure out how to get it to work. You can pass a function into the groupby method. If you choose to go that route, note that the function is to take in indices of the Dataframe. Make's me think it would be hard to use the groupby since I don't know that one row can belong to multiple groups...



来源:https://stackoverflow.com/questions/53700854/group-by-a-dataframe-by-values-that-are-just-less-than-a-second-off-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!