Pandas “Group By” Query on Large Data in HDFStore?

前端 未结 1 874
感动是毒
感动是毒 2020-12-01 04:55

I have about 7 million rows in an HDFStore with more than 60 columns. The data is more than I can fit into memory. I\'m looking to aggregate the data into group

相关标签:
1条回答
  • 2020-12-01 05:26

    Heres a complete example.

    import numpy as np
    import pandas as pd
    import os
    
    fname = 'groupby.h5'
    
    # create a frame
    df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                             'bar', 'bar', 'bar', 'bar',
                             'foo', 'foo', 'foo'],
                       'B': ['one', 'one', 'one', 'two',
                             'one', 'one', 'one', 'two',
                             'two', 'two', 'one'],
                       'C': ['dull', 'dull', 'shiny', 'dull',
                             'dull', 'shiny', 'shiny', 'dull',
                             'shiny', 'shiny', 'shiny'],
                       'D': np.random.randn(11),
                       'E': np.random.randn(11),
                       'F': np.random.randn(11)})
    
    
    # create the store and append, using data_columns where I possibily
    # could aggregate
    with pd.get_store(fname) as store:
        store.append('df',df,data_columns=['A','B','C'])
        print "store:\n%s" % store
    
        print "\ndf:\n%s" % store['df']
    
        # get the groups
        groups = store.select_column('df','A').unique()
        print "\ngroups:%s" % groups
    
        # iterate over the groups and apply my operations
        l = []
        for g in groups:
    
            grp = store.select('df',where = [ 'A=%s' % g ])
    
            # this is a regular frame, aggregate however you would like
            l.append(grp[['D','E','F']].sum())
    
    
        print "\nresult:\n%s" % pd.concat(l, keys = groups)
    
    os.remove(fname)
    

    Output

    store:
    <class 'pandas.io.pytables.HDFStore'>
    File path: groupby.h5
    /df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])
    
    df:
          A    B      C         D         E         F
    0   foo  one   dull -0.815212 -1.195488 -1.346980
    1   foo  one   dull -1.111686 -1.814385 -0.974327
    2   foo  one  shiny -1.069152 -1.926265  0.360318
    3   foo  two   dull -0.472180  0.698369 -1.007010
    4   bar  one   dull  1.329867  0.709621  1.877898
    5   bar  one  shiny -0.962906  0.489594 -0.663068
    6   bar  one  shiny -0.657922 -0.377705  0.065790
    7   bar  two   dull -0.172245  1.694245  1.374189
    8   foo  two  shiny -0.780877 -2.334895 -2.747404
    9   foo  two  shiny -0.257413  0.577804 -0.159316
    10  foo  one  shiny  0.737597  1.979373 -0.236070
    
    groups:Index([bar, foo], dtype=object)
    
    result:
    bar  D   -0.463206
         E    2.515754
         F    2.654810
    foo  D   -3.768923
         E   -4.015488
         F   -6.110789
    dtype: float64
    

    Some caveats:

    1) This methodology makes sense if your group density is relatively low. On the order of hundreds or thousands of groups. If you get more than that there are more efficient (but more complicated methods), and your function which you are applying (in this case sum) become more restrictive.

    Essentially you would iterator over the entire store by chunks, grouping as you go, but keeping the groups only semi-collapsed (imagine doing a mean, so you would need to keep a running total plus a running count, then divide at the end). So some operations would be a bit trickier, but could potentially handle MANY groups (and is really fast).

    2) the efficiency of this could be improved by saving the coordinates (e.g. the group locations, but this is a bit more complicated)

    3) multi-grouping is not possible with this scheme (it IS possible, but requires an approach more like 2) above

    4) the columns that you want to group, MUST be a data_column!

    5) you can combine any other filter you wish in the select btw (which is a sneeky way of doing multi-grouping btw, you just form 2 unique lists of group and iterator over the product of them, not extremely efficient if you have lots of groups, but can work)

    HTH

    let me know if this works for you

    0 讨论(0)
提交回复
热议问题