pandas rolling computation with window based on values instead of counts

后端 未结 3 2045
暗喜
暗喜 2020-12-04 17:17

I\'m looking for a way to do something like the various rolling_* functions of pandas, but I want the window of the rolling computation to be defin

相关标签:
3条回答
  • 2020-12-04 17:30

    I think this does what you want:

    In [1]: df
    Out[1]:
       RollBasis  ToRoll
    0          1       1
    1          1       4
    2          1      -5
    3          2       2
    4          3      -4
    5          5      -2
    6          8       0
    7         10     -13
    8         12      -2
    9         13      -5
    
    In [2]: def f(x):
       ...:     ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
       ...:     return ser.sum()
    

    The above function takes a value, in this case RollBasis, and then indexes the data frame column ToRoll based on that value. The returned series consists of ToRoll values that meet the RollBasis + 5 criterion. Finally, that series is summed and returned.

    In [3]: df['Rolled'] = df.RollBasis.apply(f)
    
    In [4]: df
    Out[4]:
       RollBasis  ToRoll  Rolled
    0          1       1      -4
    1          1       4      -4
    2          1      -5      -4
    3          2       2      -4
    4          3      -4      -6
    5          5      -2      -2
    6          8       0     -15
    7         10     -13     -20
    8         12      -2      -7
    9         13      -5      -5
    

    Code for the toy example DataFrame in case someone else wants to try:

    In [1]: from pandas import *
    
    In [2]: import io
    
    In [3]: text = """\
       ...:    RollBasis  ToRoll
       ...: 0          1       1
       ...: 1          1       4
       ...: 2          1      -5
       ...: 3          2       2
       ...: 4          3      -4
       ...: 5          5      -2
       ...: 6          8       0
       ...: 7         10     -13
       ...: 8         12      -2
       ...: 9         13      -5
       ...: """
    
    In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')
    
    0 讨论(0)
  • 2020-12-04 17:43

    Based on BrenBarns's answer, but speeded up by using label based indexing rather than boolean based indexing:

    def rollBy(what,basis,window,func,*args,**kwargs):
        #note that basis must be sorted in order for this to work properly     
        indexed_what = pd.Series(what.values,index=basis.values)
        def applyToWindow(val):
            # using slice_indexer rather that what.loc [val:val+window] allows
            # window limits that are not specifically in the index
            indexer = indexed_what.index.slice_indexer(val,val+window,1)
            chunk = indexed_what[indexer]
            return func(chunk,*args,**kwargs)
        rolled = basis.apply(applyToWindow)
        return rolled
    

    This is much faster than not using an indexed column:

    In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})
    
    In [47]: df = df.sort("RollBasis")
    
    In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
    Out[48]: 67.6615059375763
    
    In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
    Out[49]: 515.0221037864685
    

    Its worth noting that the index based solution is O(n), while the logical slicing version is O(n^2) in the average case (I think).

    I find it more useful to do this over evenly spaced windows from the min value of Basis to the max value of Basis, rather than at every value of basis. This means altering the function thus:

    def rollBy(what,basis,window,func,*args,**kwargs):
        #note that basis must be sorted in order for this to work properly
        windows_min = basis.min()
        windows_max = basis.max()
        window_starts = np.arange(windows_min, windows_max, window)
        window_starts = pd.Series(window_starts, index = window_starts)
        indexed_what = pd.Series(what.values,index=basis.values)
        def applyToWindow(val):
            # using slice_indexer rather that what.loc [val:val+window] allows
            # window limits that are not specifically in the index
            indexer = indexed_what.index.slice_indexer(val,val+window,1)
            chunk = indexed_what[indexer]
            return func(chunk,*args,**kwargs)
        rolled = window_starts.apply(applyToWindow)
        return rolled
    
    0 讨论(0)
  • 2020-12-04 17:48

    Based on Zelazny7's answer, I created this more general solution:

    def rollBy(what, basis, window, func):
        def applyToWindow(val):
            chunk = what[(val<=basis) & (basis<val+window)]
            return func(chunk)
        return basis.apply(applyToWindow)
    
    >>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
    0    -4
    1    -4
    2    -4
    3    -4
    4    -6
    5    -2
    6   -15
    7   -20
    8    -7
    9    -5
    Name: RollBasis
    

    It's still not ideal as it is very slow compared to rolling_apply, but perhaps this is inevitable.

    0 讨论(0)
提交回复
热议问题