问题
Let's say i have a pandas dataframe as below:
>>> df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246','2018-12-10 16:36:34.243','2018-12-10 16:38:34.216','2018-12-10 16:42:34.123']),'value':[1,2,3,4]})
>>> df
dt value
0 2018-12-10 16:35:34.246 1
1 2018-12-10 16:36:34.243 2
2 2018-12-10 16:38:34.216 3
3 2018-12-10 16:42:34.123 4
>>>
I would like to group this dataframe by 'dt'
column, but i want to group it in a way that it thinks the values that are less than a second different are the same, after grouping those i would like to sum up the 'value'
column based on each group, and i want the dataframe two remain the same length, so the less than one second difference values would be all a duplicate value, i so far tried:
>>> df.groupby('dt',as_index=False)['value'].sum()
dt value
0 2018-12-10 16:35:34.246 1
1 2018-12-10 16:36:34.243 2
2 2018-12-10 16:38:34.216 3
3 2018-12-10 16:42:34.123 4
>>>
But as you see, the dataframe didn't change because this groups by equivalent 'dt'
column values.
My desired output is:
dt value
0 2018-12-10 16:35:34.246 3
1 2018-12-10 16:36:34.243 3
2 2018-12-10 16:38:34.216 3
3 2018-12-10 16:42:34.123 4
回答1:
A brute force solution is to take the absolute difference between your datetime
series and each datetime
value, then compare against a threshold:
# data from @StephenCowley
threshold = pd.Timedelta(seconds=1)
df['val'] = [df.loc[(df['dt'] - t).abs() < threshold, 'value'].sum()
for t in df['dt']]
print(df)
dt value val
0 2018-12-10 16:35:34.246 1 3
1 2018-12-10 16:35:34.243 2 3
2 2018-12-10 16:38:34.216 3 3
3 2018-12-10 16:42:34.123 4 4
回答2:
(Assuming you meant the first two to have the same minute value.)
I'm not sure how to do it with groupby, but here something with the same results:
df=pd.DataFrame({'dt':pd.to_datetime(['2018-12-10 16:35:34.246',
'2018-12-10 16:35:34.243',
'2018-12-10 16:38:34.216',
'2018-12-10 16:42:34.123']),
'value':[1,2,3,4]})
# Select the rows that are greater than a second less
# And less than a second more
# Get their value columns and sum them
df['val'] = [df[(df.dt>t-pd.Timedelta(seconds=1))&
(df.dt<t+pd.Timedelta(seconds=1))]['value'].sum()
for t in df.dt]
dt value val
0 2018-12-10 16:35:34.246 1 3
1 2018-12-10 16:35:34.243 2 3
2 2018-12-10 16:38:34.216 3 3
3 2018-12-10 16:42:34.123 4 4
As a sidenote, I looked into doing this same sort of thing with groupby
but I couldn't figure out how to get it to work. You can pass a function into the groupby
method. If you choose to go that route, note that the function is to take in indices of the Dataframe. Make's me think it would be hard to use the groupby since I don't know that one row can belong to multiple groups...
来源:https://stackoverflow.com/questions/53700854/group-by-a-dataframe-by-values-that-are-just-less-than-a-second-off-pandas