How to iterate over consecutive chunks of Pandas dataframe efficiently

前端 未结 6 1767
悲&欢浪女
悲&欢浪女 2020-11-28 03:52

I have a large dataframe (several million rows).

I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-size

6条回答
  •  猫巷女王i
    2020-11-28 04:10

    I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.

    Here's a short example from that thread, which might do something like what you want:

    import numpy as np
    import pandas as pds
    
    df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
    
    def chunker(seq, size):
        return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
    
    for i in chunker(df,5):
        print i
    

    Which gives you something like this:

              a         b         c         d
    0  0.860574  0.059326  0.339192  0.786399
    1  0.029196  0.395613  0.524240  0.380265
    2  0.235759  0.164282  0.350042  0.877004
    3  0.545394  0.881960  0.994079  0.721279
    4  0.584504  0.648308  0.655147  0.511390
              a         b         c         d
    5  0.276160  0.982803  0.451825  0.845363
    6  0.728453  0.246870  0.515770  0.343479
    7  0.971947  0.278430  0.006910  0.888512
    8  0.044888  0.875791  0.842361  0.890675
    9  0.200563  0.246080  0.333202  0.574488
               a         b         c         d
    10  0.971125  0.106790  0.274001  0.960579
    11  0.722224  0.575325  0.465267  0.258976
    12  0.574039  0.258625  0.469209  0.886768
    13  0.915423  0.713076  0.073338  0.622967
    

    I hope that helps.

    EDIT

    In this case, I used this function with pool of processors in (approximately) this manner:

    from multiprocessing import Pool
    
    nprocs = 4
    
    pool = Pool(nprocs)
    
    for chunk in chunker(df, nprocs):
        data = pool.map(myfunction, chunk)
        data.domorestuff()
    

    I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.

提交回复
热议问题