group by a dataframe by values that are just less than a second off - pandas

问题

Let's say i have a pandas dataframe as below:

>>> df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246','2018-12-10 16:36:34.243','2018-12-10 16:38:34.216','2018-12-10 16:42:34.123']),'value':[1,2,3,4]})
>>> df
                       dt  value
0 2018-12-10 16:35:34.246      1
1 2018-12-10 16:36:34.243      2
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4
>>>

I would like to group this dataframe by 'dt' column, but i want to group it in a way that it thinks the values that are less than a second different are the same, after grouping those i would like to sum up the 'value' column based on each group, and i want the dataframe two remain the same length, so the less than one second difference values would be all a duplicate value, i so far tried:

>>> df.groupby('dt',as_index=False)['value'].sum()
                       dt  value
0 2018-12-10 16:35:34.246      1
1 2018-12-10 16:36:34.243      2
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4
>>>

But as you see, the dataframe didn't change because this groups by equivalent 'dt' column values.

My desired output is:

                       dt  value
0 2018-12-10 16:35:34.246      3
1 2018-12-10 16:36:34.243      3
2 2018-12-10 16:38:34.216      3
3 2018-12-10 16:42:34.123      4

回答1:

A brute force solution is to take the absolute difference between your datetime series and each datetime value, then compare against a threshold:

# data from @StephenCowley

threshold = pd.Timedelta(seconds=1)

df['val'] = [df.loc[(df['dt'] - t).abs() < threshold, 'value'].sum()
             for t in df['dt']]

print(df)

                       dt  value  val
0 2018-12-10 16:35:34.246      1    3
1 2018-12-10 16:35:34.243      2    3
2 2018-12-10 16:38:34.216      3    3
3 2018-12-10 16:42:34.123      4    4

回答2:

(Assuming you meant the first two to have the same minute value.)

I'm not sure how to do it with groupby, but here something with the same results:

df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246',
                                      '2018-12-10 16:35:34.243',
                                      '2018-12-10 16:38:34.216',
                                      '2018-12-10 16:42:34.123']),
                                      'value':[1,2,3,4]})

            # Select the rows that are greater than a second less
            # And less than a second more
            # Get their value columns and sum them
df['val'] = [df[(df.dt>t-pd.Timedelta(seconds=1))&
                (df.dt<t+pd.Timedelta(seconds=1))]['value'].sum()
             for t in df.dt]

                       dt  value  val
0 2018-12-10 16:35:34.246      1    3
1 2018-12-10 16:35:34.243      2    3
2 2018-12-10 16:38:34.216      3    3
3 2018-12-10 16:42:34.123      4    4

As a sidenote, I looked into doing this same sort of thing with groupby but I couldn't figure out how to get it to work. You can pass a function into the groupby method. If you choose to go that route, note that the function is to take in indices of the Dataframe. Make's me think it would be hard to use the groupby since I don't know that one row can belong to multiple groups...

来源：https://stackoverflow.com/questions/53700854/group-by-a-dataframe-by-values-that-are-just-less-than-a-second-off-pandas

标签

python

pandas

dataframe

sum

pandas-groupby