Access the result of a previous calculation in custom function passed to apply()

前端未结

关注

 1  1023

不要未来只要你来 2020-12-21 07:57

I\'m working with Pandas in Python and I would like to access the result of the previous calculation when applying a custom function to a series.

Roughly like this:

1条回答

-上瘾入骨i (楼主)

2020-12-21 08:44
The most special type of the operations you describe are available as cummax, cummin, cumprod and cumsum (f(x) = x + f(x-1)).

More functionality can be found in expanding objects: mean, standard deviation, variance kurtosis, skewness, correlation, etc.

And for the most general case, you can use expanding().apply() with a custom function. For example,
```
from functools import reduce  # For Python 3.x
ser.expanding().apply(lambda r: reduce(lambda prev, value: prev + 2*value, r))
```
is equivalent to f(x) = 2x + f(x-1)

The methods I listed are optimized and run quite fast but when you use a custom function the performance gets worse. For exponential smoothing, pandas starts to outperform loops for Series of length 1000 but expanding().apply()'s performance with reduce is quite bad:
```
np.random.seed(0)    
ser = pd.Series(70 + 5*np.random.randn(10**4))    
ser.tail()
Out: 
9995    60.953592
9996    70.211794
9997    72.584361
9998    69.835397
9999    76.490557
dtype: float64


ser.ewm(alpha=0.1, adjust=False).mean().tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit ser.ewm(alpha=0.1, adjust=False).mean()
1000 loops, best of 3: 779 µs per loop
```
With loops:
```
def exp_smoothing(ser, alpha=0.1):
    prev = ser[0]
    res = [prev]
    for cur in ser[1:]:
        prev = alpha*cur + (1-alpha)*prev
        res.append(prev)
    return pd.Series(res, index=ser.index)

exp_smoothing(ser).tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit exp_smoothing(ser)
100 loops, best of 3: 3.54 ms per loop
```
Total time is still in milliseconds but with expanding().apply():
```
ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r)).tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r))
1 loop, best of 3: 13 s per loop
```
Methods like cummin, cumsum are optimized and only require x's current value and function's previous value. However with a custom function the complexity is O(n**2). This is mainly because there will be cases that function's previous value and x's current value won't be enough to calculate function's current value. For cumsum, you can use previous cumsum and add the current value to reach a result. You cannot do that for, say, geometric mean. That's why expanding will become unusable for even moderately sized Series.

In general, iterating over a Series is not a very expensive operation. For DataFrames, it needs to return a copy of each row so it is very inefficient but this is not the case for Series. Of course you should use vectorized methods when available but if that's not the case, using a for loop for a task like recursive calculation is OK.
0 讨论(0)
发布评论:

提交评论
- 加载中...