Numpy, Pandas: what is the fastest way to calculate dataset row value basing on previous N values?

问题

I have a dataset and I want to enrich it. I need to calculate some new dataset column which is some function of previous N rows of another column.

As an example, given I want to calculate binary column which shows if current day temperature is higher than average in previous N days.

At the moment I just iterate through all the pandas dataset values using df.iterrows() and do appropriate calculations. This takes some time. Is there any better option?

回答1:

use rolling/moving window functions.

Sample DF:

In [46]: df = pd.DataFrame({'date':pd.date_range('2000-01-01', freq='D', periods=15), 'temp':np.random.rand(15)*20})

In [47]: df
Out[47]:
         date       temp
0  2000-01-01  17.246616
1  2000-01-02  18.228468
2  2000-01-03   6.245991
3  2000-01-04   8.890069
4  2000-01-05   6.837285
5  2000-01-06   1.555924
6  2000-01-07  18.641918
7  2000-01-08   6.308174
8  2000-01-09  13.601203
9  2000-01-10   6.482098
10 2000-01-11  15.711497
11 2000-01-12  18.690925
12 2000-01-13   2.493110
13 2000-01-14  17.626622
14 2000-01-15   6.982129

Solution:

In [48]: df['higher_3avg'] = df.rolling(3)['temp'].mean().diff().gt(0)

In [49]: df
Out[49]:
         date       temp  higher_3avg
0  2000-01-01  17.246616        False
1  2000-01-02  18.228468        False
2  2000-01-03   6.245991        False
3  2000-01-04   8.890069        False
4  2000-01-05   6.837285        False
5  2000-01-06   1.555924        False
6  2000-01-07  18.641918         True
7  2000-01-08   6.308174        False
8  2000-01-09  13.601203         True
9  2000-01-10   6.482098        False
10 2000-01-11  15.711497         True
11 2000-01-12  18.690925         True
12 2000-01-13   2.493110        False
13 2000-01-14  17.626622         True
14 2000-01-15   6.982129        False

Explanation:

In [50]: df.rolling(3)['temp'].mean()
Out[50]:
0           NaN
1           NaN
2     13.907025
3     11.121509
4      7.324448
5      5.761093
6      9.011709
7      8.835339
8     12.850431
9      8.797158
10    11.931599
11    13.628173
12    12.298511
13    12.936886
14     9.033954
Name: temp, dtype: float64

回答2:

for huge data, Numpy solutions are 30x faster. from Here :

def moving_average(a, n=3) :
    ret = a.cumsum()
    ret[n:]  -= ret[:-n]
    return ret[n - 1:] / n

In [419]: %timeit moving_average(df.values)
38.2 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [420]: %timeit df.rolling(3).mean()
1.42 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

来源：https://stackoverflow.com/questions/46872471/numpy-pandas-what-is-the-fastest-way-to-calculate-dataset-row-value-basing-on

标签

python

pandas

numpy

dataset