Python pandas rolling_apply two column input into function

前端 未结 4 567
故里飘歌
故里飘歌 2020-12-08 05:25

Following on from this question Python custom function using rolling_apply for pandas, about using rolling_apply. Although I have progressed with my function, I

相关标签:
4条回答
  • 2020-12-08 05:50

    Not sure if still relevant here, with the new rolling classes on pandas, whenever we pass raw=False to apply, we are actually passing the series to the wraper, which means we have access to the index of each observation, and can use that to further handle multiple columns.

    From the docs:

    raw : bool, default None

    False : passes each row or column as a Series to the function.

    True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

    In this scenario, we can do the following:

    ### create a func for multiple columns
    def cust_func(s):
    
        val_for_col2 = df.loc[s.index, col2] #.values
        val_for_col3 = df.loc[s.index, col3] #.values
        val_for_col4 = df.loc[s.index, col4] #.values
        
        ## apply over multiple column values
        return np.max(s) *np.min(val_for_col2)*np.max(val_for_col3)*np.mean(val_for_col4)
        
    
    ### Apply to the dataframe
    df.rolling('10s')['col1'].apply(cust_func, raw=False)
    

    Note that here we can still use all functionalities from pandas rolling class, which is particularly useful when dealing with time-related windows.

    The fact that we are passing one column and using the entire dataframe feels like a hack, but it works in practice.

    0 讨论(0)
  • 2020-12-08 05:52

    Here's another version of this question: Using rolling_apply on a DataFrame object. Use this if your function returns a Series.

    Since yours returns a scalar, do this.

    In [71]: df  = pd.DataFrame(np.random.randn(2000,2)/10000, 
                        index=pd.date_range('2001-01-01',periods=2000),
                        columns=['A','B'])
    

    Redefine your function to return a tuple with the index you want to use and scalar value that is computed. Note that this is slightly different as we are returning the first index here (and not the normally returned last, youy could do either).

    In [72]: def gm(df,p):
                  v =((((df['A']+df['B'])+1).cumprod())-1)*p
                  return (df.index[0],v.iloc[-1])
    
    
    In [73]: Series(dict([ gm(df.iloc[i:min((i+1)+50,len(df)-1)],5) for i in xrange(len(df)-50) ]))
    
    Out[73]: 
    2001-01-01    0.000218
    2001-01-02   -0.001048
    2001-01-03   -0.002128
    2001-01-04   -0.003590
    2001-01-05   -0.004636
    2001-01-06   -0.005377
    2001-01-07   -0.004151
    2001-01-08   -0.005155
    2001-01-09   -0.004019
    2001-01-10   -0.004912
    2001-01-11   -0.005447
    2001-01-12   -0.005258
    2001-01-13   -0.004437
    2001-01-14   -0.004207
    2001-01-15   -0.004073
    ...
    2006-04-20   -0.006612
    2006-04-21   -0.006299
    2006-04-22   -0.006320
    2006-04-23   -0.005690
    2006-04-24   -0.004316
    2006-04-25   -0.003821
    2006-04-26   -0.005102
    2006-04-27   -0.004760
    2006-04-28   -0.003832
    2006-04-29   -0.004123
    2006-04-30   -0.004241
    2006-05-01   -0.004684
    2006-05-02   -0.002993
    2006-05-03   -0.003938
    2006-05-04   -0.003528
    Length: 1950
    
    0 讨论(0)
  • 2020-12-08 05:55

    All rolling_* functions works on 1d array. I'm sure one can invent some workarounds for passing 2d arrays, but in your case, you can simply precompute row-wise values for rolling evaluation:

    >>> def gm(x,p):
    ...     return ((np.cumprod(x) - 1)*p)[-1]
    ...
    >>> pd.rolling_apply(tmp['A']+tmp['B']+1, 50, lambda x: gm(x,5))
    2001-01-01   NaN
    2001-01-02   NaN
    2001-01-03   NaN
    2001-01-04   NaN
    2001-01-05   NaN
    2001-01-06   NaN
    2001-01-07   NaN
    2001-01-08   NaN
    2001-01-09   NaN
    2001-01-10   NaN
    2001-01-11   NaN
    2001-01-12   NaN
    2001-01-13   NaN
    2001-01-14   NaN
    2001-01-15   NaN
    ...
    2006-06-09   -0.000062
    2006-06-10   -0.000128
    2006-06-11    0.000185
    2006-06-12   -0.000113
    2006-06-13   -0.000962
    2006-06-14   -0.001248
    2006-06-15   -0.001962
    2006-06-16   -0.003820
    2006-06-17   -0.003412
    2006-06-18   -0.002971
    2006-06-19   -0.003882
    2006-06-20   -0.003546
    2006-06-21   -0.002226
    2006-06-22   -0.002058
    2006-06-23   -0.000553
    Freq: D, Length: 2000
    
    0 讨论(0)
  • 2020-12-08 06:07

    Looks like rolling_apply will try to convert input of user func into ndarray (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.stats.moments.rolling_apply.html?highlight=rolling_apply#pandas.stats.moments.rolling_apply).

    Workaround based on using aux column ii which is used to select window inside of manipulating function gm:

    import pandas as pd
    import numpy as np
    import random
    
    tmp = pd.DataFrame(np.random.randn(2000,2)/10000, columns=['A','B'])
    tmp['date'] = pd.date_range('2001-01-01',periods=2000)
    tmp['ii'] = range(len(tmp))            
    
    def gm(ii, df, p):
        x_df = df.iloc[map(int, ii)]
        #print x_df
        v =((((x_df['A']+x_df['B'])+1).cumprod())-1)*p
        #print v
        return v.iloc[-1]
    
    #print tmp.head()
    res = pd.rolling_apply(tmp.ii, 50, lambda x: gm(x, tmp, 5))
    print res
    
    0 讨论(0)
提交回复
热议问题