why does pandas rolling use single dimension ndarray

后端 未结 4 1039
清歌不尽
清歌不尽 2020-11-29 02:12

I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor reg

相关标签:
4条回答
  • 2020-11-29 02:40

    Since pandas v0.23 it is now possible to pass a Series instead of a ndarray to Rolling.apply(). Just set raw=False.

    raw : bool, default None

    False : passes each row or column as a Series to the function.

    True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance. The raw parameter is required and will show a FutureWarning if not passed. In the future raw will default to False.

    New in version 0.23.0.

    As noted; if you only need one single dimension, passing it raw is obviously more efficient. This is probably the answer to your question; Rolling.apply() was initially built to pass an ndarray only because this is the most efficient.

    0 讨论(0)
  • 2020-11-29 02:44

    I wanted to share what I've done to work around this problem.

    Given a pd.DataFrame and a window, I generate a stacked ndarray using np.dstack (see answer). I then convert it to a pd.Panel and using pd.Panel.to_frame convert it to a pd.DataFrame. At this point, I have a pd.DataFrame that has an additional level on its index relative to the original pd.DataFrame and the new level contains information about each rolled period. For example, if the roll window is 3, the new index level will contain be [0, 1, 2]. An item for each period. I can now groupby level=0 and return the groupby object. This now gives me an object that I can much more intuitively manipulate.

    Roll Function

    import pandas as pd
    import numpy as np
    
    def roll(df, w):
        roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
        panel = pd.Panel(roll_array, 
                         items=df.index[w-1:],
                         major_axis=df.columns,
                         minor_axis=pd.Index(range(w), name='roll'))
        return panel.to_frame().unstack().T.groupby(level=0)
    

    Demonstration

    np.random.seed([3,1415])
    df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
    
    print df
    
          A     B
    0  0.44  0.41
    1  0.46  0.47
    2  0.46  0.02
    3  0.85  0.82
    4  0.78  0.76
    

    Let's sum

    rolled_df = roll(df, 2)
    
    print rolled_df.sum()
    
    major     A     B
    1      0.90  0.88
    2      0.92  0.49
    3      1.31  0.84
    4      1.63  1.58
    

    To peek under the hood, we can see the stucture:

    print rolled_df.apply(lambda x: x)
    
    major      A     B
      roll            
    1 0     0.44  0.41
      1     0.46  0.47
    2 0     0.46  0.47
      1     0.46  0.02
    3 0     0.46  0.02
      1     0.85  0.82
    4 0     0.85  0.82
      1     0.78  0.76
    

    But what about the purpose for which I built this, rolling multi-factor regression. But I'll settle for matrix multiplication for now.

    X = np.array([2, 3])
    
    print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 
    
          0     1
    1  2.11  2.33
    2  2.33  0.98
    3  0.98  4.16
    4  4.16  3.84
    
    0 讨论(0)
  • 2020-11-29 02:51

    Made the following modifications to the above answer since I needed to return the entire rolling window as is done in pd.DataFrame.rolling()

    def roll(df, w):
        roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
        roll_array_full_window = np.vstack((np.empty((w-1 ,len(df.columns), w)), roll_array))
        panel = pd.Panel(roll_array_full_window, 
                     items=df.index,
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
        return panel.to_frame().unstack().T.groupby(level=0)
    
    0 讨论(0)
  • 2020-11-29 02:53

    Using the strides views concept on dataframe, here's a vectorized approach -

    get_sliding_window(df, 2).dot(X) # window size = 2
    

    Runtime test -

    In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
    
    In [102]: X = np.array([2, 3])
    
    In [103]: rolled_df = roll(df, 2)
    
    In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
    100 loops, best of 3: 5.51 ms per loop
    
    In [105]: %timeit get_sliding_window(df, 2).dot(X)
    10000 loops, best of 3: 43.7 µs per loop
    

    Verify results -

    In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
    Out[106]: 
          0     1
    1  2.70  4.09
    2  4.09  2.52
    3  2.52  1.78
    4  1.78  3.50
    
    In [107]: get_sliding_window(df, 2).dot(X)
    Out[107]: 
    array([[ 2.7 ,  4.09],
           [ 4.09,  2.52],
           [ 2.52,  1.78],
           [ 1.78,  3.5 ]])
    

    Huge improvement there, which I am hoping would stay noticeable on larger arrays!

    0 讨论(0)
提交回复
热议问题