Access the result of a previous calculation in custom function passed to apply()

前端 未结 1 1006

I\'m working with Pandas in Python and I would like to access the result of the previous calculation when applying a custom function to a series.

Roughly like this:

1条回答
  •  -上瘾入骨i
    2020-12-21 08:44

    The most special type of the operations you describe are available as cummax, cummin, cumprod and cumsum (f(x) = x + f(x-1)).

    More functionality can be found in expanding objects: mean, standard deviation, variance kurtosis, skewness, correlation, etc.

    And for the most general case, you can use expanding().apply() with a custom function. For example,

    from functools import reduce  # For Python 3.x
    ser.expanding().apply(lambda r: reduce(lambda prev, value: prev + 2*value, r))
    

    is equivalent to f(x) = 2x + f(x-1)

    The methods I listed are optimized and run quite fast but when you use a custom function the performance gets worse. For exponential smoothing, pandas starts to outperform loops for Series of length 1000 but expanding().apply()'s performance with reduce is quite bad:

    np.random.seed(0)    
    ser = pd.Series(70 + 5*np.random.randn(10**4))    
    ser.tail()
    Out: 
    9995    60.953592
    9996    70.211794
    9997    72.584361
    9998    69.835397
    9999    76.490557
    dtype: float64
    
    
    ser.ewm(alpha=0.1, adjust=False).mean().tail()
    Out: 
    9995    69.871614
    9996    69.905632
    9997    70.173505
    9998    70.139694
    9999    70.774781
    dtype: float64
    
    %timeit ser.ewm(alpha=0.1, adjust=False).mean()
    1000 loops, best of 3: 779 µs per loop
    

    With loops:

    def exp_smoothing(ser, alpha=0.1):
        prev = ser[0]
        res = [prev]
        for cur in ser[1:]:
            prev = alpha*cur + (1-alpha)*prev
            res.append(prev)
        return pd.Series(res, index=ser.index)
    
    exp_smoothing(ser).tail()
    Out: 
    9995    69.871614
    9996    69.905632
    9997    70.173505
    9998    70.139694
    9999    70.774781
    dtype: float64
    
    %timeit exp_smoothing(ser)
    100 loops, best of 3: 3.54 ms per loop
    

    Total time is still in milliseconds but with expanding().apply():

    ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r)).tail()
    Out: 
    9995    69.871614
    9996    69.905632
    9997    70.173505
    9998    70.139694
    9999    70.774781
    dtype: float64
    
    %timeit ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r))
    1 loop, best of 3: 13 s per loop
    

    Methods like cummin, cumsum are optimized and only require x's current value and function's previous value. However with a custom function the complexity is O(n**2). This is mainly because there will be cases that function's previous value and x's current value won't be enough to calculate function's current value. For cumsum, you can use previous cumsum and add the current value to reach a result. You cannot do that for, say, geometric mean. That's why expanding will become unusable for even moderately sized Series.

    In general, iterating over a Series is not a very expensive operation. For DataFrames, it needs to return a copy of each row so it is very inefficient but this is not the case for Series. Of course you should use vectorized methods when available but if that's not the case, using a for loop for a task like recursive calculation is OK.

    0 讨论(0)
提交回复
热议问题