问题
I have a dataset and I want to enrich it. I need to calculate some new dataset column which is some function of previous N rows of another column.
As an example, given I want to calculate binary column which shows if current day temperature is higher than average in previous N days.
At the moment I just iterate through all the pandas dataset values using df.iterrows() and do appropriate calculations. This takes some time. Is there any better option?
回答1:
use rolling/moving window functions.
Sample DF:
In [46]: df = pd.DataFrame({'date':pd.date_range('2000-01-01', freq='D', periods=15), 'temp':np.random.rand(15)*20})
In [47]: df
Out[47]:
date temp
0 2000-01-01 17.246616
1 2000-01-02 18.228468
2 2000-01-03 6.245991
3 2000-01-04 8.890069
4 2000-01-05 6.837285
5 2000-01-06 1.555924
6 2000-01-07 18.641918
7 2000-01-08 6.308174
8 2000-01-09 13.601203
9 2000-01-10 6.482098
10 2000-01-11 15.711497
11 2000-01-12 18.690925
12 2000-01-13 2.493110
13 2000-01-14 17.626622
14 2000-01-15 6.982129
Solution:
In [48]: df['higher_3avg'] = df.rolling(3)['temp'].mean().diff().gt(0)
In [49]: df
Out[49]:
date temp higher_3avg
0 2000-01-01 17.246616 False
1 2000-01-02 18.228468 False
2 2000-01-03 6.245991 False
3 2000-01-04 8.890069 False
4 2000-01-05 6.837285 False
5 2000-01-06 1.555924 False
6 2000-01-07 18.641918 True
7 2000-01-08 6.308174 False
8 2000-01-09 13.601203 True
9 2000-01-10 6.482098 False
10 2000-01-11 15.711497 True
11 2000-01-12 18.690925 True
12 2000-01-13 2.493110 False
13 2000-01-14 17.626622 True
14 2000-01-15 6.982129 False
Explanation:
In [50]: df.rolling(3)['temp'].mean()
Out[50]:
0 NaN
1 NaN
2 13.907025
3 11.121509
4 7.324448
5 5.761093
6 9.011709
7 8.835339
8 12.850431
9 8.797158
10 11.931599
11 13.628173
12 12.298511
13 12.936886
14 9.033954
Name: temp, dtype: float64
回答2:
for huge data, Numpy solutions are 30x faster. from Here :
def moving_average(a, n=3) :
ret = a.cumsum()
ret[n:] -= ret[:-n]
return ret[n - 1:] / n
In [419]: %timeit moving_average(df.values)
38.2 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [420]: %timeit df.rolling(3).mean()
1.42 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
来源:https://stackoverflow.com/questions/46872471/numpy-pandas-what-is-the-fastest-way-to-calculate-dataset-row-value-basing-on