问题
This is somewhat of a broad topic, but I will try to pare it to some specific questions.
I have noticed a difference between resample and groupby that I am curious to learn about. Here is some hourly time series data:
In[]:
import pandas as pd
dr = pd.date_range('01-01-2020 8:00', periods=10, freq='H')
df = pd.DataFrame({'A':range(10),
'B':range(10,20),
'C':range(20,30)}, index=dr)
df
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
2020-01-01 10:00:00 2 12 22
2020-01-01 11:00:00 3 13 23
2020-01-01 12:00:00 4 14 24
2020-01-01 13:00:00 5 15 25
2020-01-01 14:00:00 6 16 26
2020-01-01 15:00:00 7 17 27
2020-01-01 16:00:00 8 18 28
2020-01-01 17:00:00 9 19 29
I can downsample the data using either groupby with a freq pandas.Grouper or resample (which seems the more typical thing to do):
g = df.groupby(pd.Grouper(freq='2H'))
r = df.resample(rule='2H')
My impression was that these two were essentially the same thing (and correct me if I am wrong but resampleis a rebranded groupby)? But I have found that when using the apply method of each grouped object, you can index specific columns in the "DataFrameGroupBy" g object but not the "Resampler" object r:
def foo(d):
return(d['A'] - d['B'] + 2*d['C'])
In[]:
g.apply(foo)
Out[]:
2020-01-01 08:00:00 2020-01-01 08:00:00 30
2020-01-01 09:00:00 32
2020-01-01 10:00:00 2020-01-01 10:00:00 34
2020-01-01 11:00:00 36
2020-01-01 12:00:00 2020-01-01 12:00:00 38
2020-01-01 13:00:00 40
2020-01-01 14:00:00 2020-01-01 14:00:00 42
2020-01-01 15:00:00 44
2020-01-01 16:00:00 2020-01-01 16:00:00 46
2020-01-01 17:00:00 48
dtype: int64
In[]:
r.apply(foo)
Out[]:
#long multi-Exception error stack ending in:
KeyError: 'A'
It looks like the data d that the apply "sees" is different in each case, as shown by:
def bar(d):
print(d)
In[]:
g.apply(bar)
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
... #more DataFrames corresponding to each bin
In[]:
r.apply(bar)
Out[]:
2020-01-01 08:00:00 0
2020-01-01 09:00:00 1
Name: A, dtype: int64
2020-01-01 10:00:00 2
2020-01-01 11:00:00 3
Name: A, dtype: int64
... #more Series, first the bins for column "A", then "B", then "C"
However, if you simply iterate over the Resampler object, you get the bins as DataFrames, which seems similar to groupby:
In[]:
for i, d in r:
print(d)
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
A B C
2020-01-01 10:00:00 2 12 22
2020-01-01 11:00:00 3 13 23
A B C
2020-01-01 12:00:00 4 14 24
2020-01-01 13:00:00 5 15 25
A B C
2020-01-01 14:00:00 6 16 26
2020-01-01 15:00:00 7 17 27
A B C
2020-01-01 16:00:00 8 18 28
2020-01-01 17:00:00 9 19 29
The printout is the same when iterating over the DataFrameGroupBy object.
My questions based on the above?
- Can you access specific columns using
resampleandapply? I thought I had code where I did this but now I think I am mistaken. - Why does the
resampleapplywork on Series for each column for each bin, instead ofDataFramesfor each bin?
Any general comments about what is going on here, or whether this pattern should be encouraged or discouraged, would also be appreciated. Thanks!
来源:https://stackoverflow.com/questions/62902115/what-is-the-difference-between-bins-when-using-groupby-apply-vs-resample-apply