How to iterate over consecutive chunks of Pandas dataframe efficiently

前端 未结 6 1755
悲&欢浪女
悲&欢浪女 2020-11-28 03:52

I have a large dataframe (several million rows).

I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-size

6条回答
  •  感情败类
    2020-11-28 04:18

    In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

    >>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)
    >>> df[0] = range(15)
    >>> df
        0         1         2         3         4
    0   0  0.746300  0.346277  0.220362  0.172680
    0   1  0.657324  0.687169  0.384196  0.214118
    0   2  0.016062  0.858784  0.236364  0.963389
    [...]
    0  13  0.510273  0.051608  0.230402  0.756921
    0  14  0.950544  0.576539  0.642602  0.907850
    
    [15 rows x 5 columns]
    

    where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

    >>> df.groupby(np.arange(len(df))//10)
    
    >>> for k,g in df.groupby(np.arange(len(df))//10):
    ...     print(k,g)
    ...     
    0    0         1         2         3         4
    0  0  0.746300  0.346277  0.220362  0.172680
    0  1  0.657324  0.687169  0.384196  0.214118
    0  2  0.016062  0.858784  0.236364  0.963389
    [...]
    0  8  0.241049  0.246149  0.241935  0.563428
    0  9  0.493819  0.918858  0.193236  0.266257
    
    [10 rows x 5 columns]
    1     0         1         2         3         4
    0  10  0.037693  0.370789  0.369117  0.401041
    0  11  0.721843  0.862295  0.671733  0.605006
    [...]
    0  14  0.950544  0.576539  0.642602  0.907850
    
    [5 rows x 5 columns]
    

    Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

提交回复
热议问题