How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

前端 未结 1 1034
长发绾君心
长发绾君心 2020-12-08 17:16

I\'m importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load i

相关标签:
1条回答
  • 2020-12-08 17:38

    You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.

    import pandas as pd
    import numpy as np
    import os
    
    files = ['test1.csv','test2.csv']
    for f in files:
        pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
    
    path = 'test.h5'
    if os.path.exists(path):
        os.remove(path)
    
    with pd.get_store(path) as store:
        for f in files:
            df = pd.read_csv(f,index_col=0)
            try:
                nrows = store.get_storer('foo').nrows
            except:
                nrows = 0
    
            df.index = pd.Series(df.index) + nrows
            store.append('foo',df)
    
    
    In [10]: pd.read_hdf('test.h5','foo')
    Out[10]: 
               A         B
    0   0.772017  0.153381
    1   0.304131  0.368573
    2   0.995465  0.799655
    3  -0.326959  0.923280
    4  -0.808376  0.449645
    5  -1.336166  0.236968
    6  -0.593523 -0.359080
    7  -0.098482  0.037183
    8   0.315627 -1.027162
    9  -1.084545 -1.922288
    10  0.412407 -0.270916
    11  1.835381 -0.737411
    12 -0.607571  0.507790
    13  0.043509 -0.294086
    14 -0.465210  0.880798
    15  1.181344  0.354411
    16  0.501892 -0.358361
    17  0.633256  0.419397
    18  0.932354 -0.603932
    19 -0.341135  2.453220
    

    You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.

    In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
    Out[11]: 
               A         B
    12 -0.607571  0.507790
    13  0.043509 -0.294086
    14 -0.465210  0.880798
    
    0 讨论(0)
提交回复
热议问题