Pandas iterating over multiple rows at once with overlap

前端 未结 2 1022
旧时难觅i
旧时难觅i 2020-12-17 05:48

I have a pandas DataFrame that need to be fed in chunks of n-rows into downstream functions (print in the example). The chunks may have overlapping rows.

<
相关标签:
2条回答
  • 2020-12-17 06:14

    Use DataFrame.groupby with integer division with helper 1d array created with same length like df - index values are not overlapped:

    d = {'A':list(range(5)), 'B':list(range(5))}
    df=pd.DataFrame(d)
    
    print (np.arange(len(df)) // 2)
    [0 0 1 1 2]
    
    for i, g in df.groupby(np.arange(len(df)) // 2):
        print (g)
    
       A  B
    0  0  0
    1  1  1
       A  B
    2  2  2
    3  3  3
       A  B
    4  4  4
    

    EDIT:

    For overlapping values is edited this answer:

    def chunker1(seq, size):
        return (seq.iloc[pos:pos + size] for pos in range(0, len(seq)-1))
    
    for i in chunker1(df,2):
        print (i)
    
       A  B
    0  0  0
    1  1  1
       A  B
    1  1  1
    2  2  2
       A  B
    2  2  2
    3  3  3
       A  B
    3  3  3
    4  4  4
    
    0 讨论(0)
  • 2020-12-17 06:30

    Overlapping chunks generator function for iterating pandas Dataframes and Series

    The chunk function with overlap parameter for control overlapping factor

    A generator version of the chunk function with step parameter for control overlapping factor is presented below. Moreover this version works with custom index of the pd.DataFrame or pd.Series (e.g. float type index). For more convenience (to check overlapping), the integer index is used here.

       sz = 14
       # ind = np.linspace(0., 10., num=sz)
       ind = range(sz)
    
       df = pd.DataFrame(np.random.rand(sz,4),
                         index=ind,
                         columns=['a', 'b', 'c', 'd'])
    
       def chunker(seq, size, overlap):
           for pos in range(0, len(seq), size-overlap):
               yield seq.iloc[pos:pos + size] 
    
       chunk_size = 6
       chunk_overlap = 2
       for i in chunker(df, chunk_size, chunk_overlap):
           print(i)
    
       chnk = chunker(df, chunk_size, chunk_overlap)
       print('\n', chnk, end='\n\n')
       print('First "next()":', next(chnk), sep='\n', end='\n\n')
       print('Second "next()":', next(chnk), sep='\n', end='\n\n')
       print('Third "next()":', next(chnk), sep='\n', end='\n\n')
    

    The output for the overlapping size = 2

              a         b         c         d
    0  0.577076  0.025997  0.692832  0.884328
    1  0.504888  0.575851  0.514702  0.056509
    2  0.880886  0.563262  0.292375  0.881445
    3  0.360011  0.978203  0.799485  0.409740
    4  0.774816  0.332331  0.809632  0.675279
    5  0.453223  0.621464  0.066353  0.083502
              a         b         c         d
    4  0.774816  0.332331  0.809632  0.675279
    5  0.453223  0.621464  0.066353  0.083502
    6  0.985677  0.110076  0.724568  0.990237
    7  0.109516  0.777629  0.485162  0.275508
    8  0.765256  0.226010  0.262838  0.758222
    9  0.805593  0.760361  0.833966  0.024916
               a         b         c         d
    8   0.765256  0.226010  0.262838  0.758222
    9   0.805593  0.760361  0.833966  0.024916
    10  0.418790  0.305439  0.258288  0.988622
    11  0.978391  0.013574  0.427689  0.410877
    12  0.943751  0.331948  0.823607  0.847441
    13  0.359432  0.276289  0.980688  0.996048
               a         b         c         d
    12  0.943751  0.331948  0.823607  0.847441
    13  0.359432  0.276289  0.980688  0.996048
    
     
    
    First "next()":
              a         b         c         d
    0  0.577076  0.025997  0.692832  0.884328
    1  0.504888  0.575851  0.514702  0.056509
    2  0.880886  0.563262  0.292375  0.881445
    3  0.360011  0.978203  0.799485  0.409740
    4  0.774816  0.332331  0.809632  0.675279
    5  0.453223  0.621464  0.066353  0.083502
    
    Second "next()":
              a         b         c         d
    4  0.774816  0.332331  0.809632  0.675279
    5  0.453223  0.621464  0.066353  0.083502
    6  0.985677  0.110076  0.724568  0.990237
    7  0.109516  0.777629  0.485162  0.275508
    8  0.765256  0.226010  0.262838  0.758222
    9  0.805593  0.760361  0.833966  0.024916
    
    Third "next()":
               a         b         c         d
    8   0.765256  0.226010  0.262838  0.758222
    9   0.805593  0.760361  0.833966  0.024916
    10  0.418790  0.305439  0.258288  0.988622
    11  0.978391  0.013574  0.427689  0.410877
    12  0.943751  0.331948  0.823607  0.847441
    13  0.359432  0.276289  0.980688  0.996048
    
    0 讨论(0)
提交回复
热议问题