I have a pandas DataFrame that need to be fed in chunks of n-rows into downstream functions (print
in the example). The chunks may have overlapping rows.
Use DataFrame.groupby with integer division with helper 1d array created with same length like df
- index values are not overlapped:
d = {'A':list(range(5)), 'B':list(range(5))}
df=pd.DataFrame(d)
print (np.arange(len(df)) // 2)
[0 0 1 1 2]
for i, g in df.groupby(np.arange(len(df)) // 2):
print (g)
A B
0 0 0
1 1 1
A B
2 2 2
3 3 3
A B
4 4 4
EDIT:
For overlapping values is edited this answer:
def chunker1(seq, size):
return (seq.iloc[pos:pos + size] for pos in range(0, len(seq)-1))
for i in chunker1(df,2):
print (i)
A B
0 0 0
1 1 1
A B
1 1 1
2 2 2
A B
2 2 2
3 3 3
A B
3 3 3
4 4 4
A generator version of the chunk function with step parameter for control overlapping factor is presented below. Moreover this version works with custom index of the pd.DataFrame or pd.Series (e.g. float type index). For more convenience (to check overlapping), the integer index is used here.
sz = 14
# ind = np.linspace(0., 10., num=sz)
ind = range(sz)
df = pd.DataFrame(np.random.rand(sz,4),
index=ind,
columns=['a', 'b', 'c', 'd'])
def chunker(seq, size, overlap):
for pos in range(0, len(seq), size-overlap):
yield seq.iloc[pos:pos + size]
chunk_size = 6
chunk_overlap = 2
for i in chunker(df, chunk_size, chunk_overlap):
print(i)
chnk = chunker(df, chunk_size, chunk_overlap)
print('\n', chnk, end='\n\n')
print('First "next()":', next(chnk), sep='\n', end='\n\n')
print('Second "next()":', next(chnk), sep='\n', end='\n\n')
print('Third "next()":', next(chnk), sep='\n', end='\n\n')
The output for the overlapping size = 2
a b c d 0 0.577076 0.025997 0.692832 0.884328 1 0.504888 0.575851 0.514702 0.056509 2 0.880886 0.563262 0.292375 0.881445 3 0.360011 0.978203 0.799485 0.409740 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 a b c d 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 6 0.985677 0.110076 0.724568 0.990237 7 0.109516 0.777629 0.485162 0.275508 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 a b c d 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 10 0.418790 0.305439 0.258288 0.988622 11 0.978391 0.013574 0.427689 0.410877 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048 a b c d 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048 First "next()": a b c d 0 0.577076 0.025997 0.692832 0.884328 1 0.504888 0.575851 0.514702 0.056509 2 0.880886 0.563262 0.292375 0.881445 3 0.360011 0.978203 0.799485 0.409740 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 Second "next()": a b c d 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 6 0.985677 0.110076 0.724568 0.990237 7 0.109516 0.777629 0.485162 0.275508 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 Third "next()": a b c d 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 10 0.418790 0.305439 0.258288 0.988622 11 0.978391 0.013574 0.427689 0.410877 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048