How to speed up Pandas multilevel dataframe shift by group?

后端 未结 4 1914
离开以前
离开以前 2021-01-22 03:36

I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:

 In [8]: df = mul_df(5,4,3)

In [9]: df
Out[9]:
                    


        
4条回答
  •  半阙折子戏
    2021-01-22 04:02

    similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

    Code (including test setup) is:

    #
    # the function to use in apply
    #
    def replace_shift_overlap(grp,col,N,value):
        if (N > 0):
            grp[col][:N] = value
        else:
            grp[col][N:] = value
        return grp
    
    
    length = 5
    groups = 3
    rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
    frames = []
    for x in xrange(0,groups):
        tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
        frames.append(tmpdf)
    df = pd.concat(frames)
    
    df.sort(columns=['category','date'],inplace=True)
    df.set_index(['category','date'],inplace=True,drop=True)
    shiftBy=-1
    df['tmpShift'] = df['colB'].shift(shiftBy)
    
    # 
    # the apply
    #
    df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
    
    # Yay this is so much faster.
    df['newColumn'] = df['tmpShift'] / df['colA']
    df.drop('tmpShift',1,inplace=True)
    

    EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.

提交回复
热议问题