Selecting columns from pandas.HDFStore table

前端 未结 3 1758
遥遥无期
遥遥无期 2020-12-13 00:52

How can I retrieve specific columns from a pandas HDFStore? I regularly work with very large data sets that are too big to manipulate in memory. I would like to read in a

3条回答
  •  没有蜡笔的小新
    2020-12-13 01:35

    You can store the dataframe with an index of the columns as follows:

    import pandas as pd
    import numpy as np
    from pandas.io.pytables import Term
    
    index = pd.date_range('1/1/2000', periods=8)
    df = pd.DataFrame( np.random.randn(8,3), index=index, columns=list('ABC'))  
    
    store = pd.HDFStore('mydata.h5')
    store.append('df_cols', df, axes='columns')
    

    and then select as you might hope:

    In [8]: store.select('df_cols', [Term('columns', '=', 'A')])
    Out[8]: 
    2000-01-01    0.347644
    2000-01-02    0.477167
    2000-01-03    1.419741
    2000-01-04    0.641400
    2000-01-05   -1.313405
    2000-01-06   -0.137357
    2000-01-07   -1.208429
    2000-01-08   -0.539854
    

    Where:

    In [9]: df
    Out[9]: 
                       A         B         C
    2000-01-01  0.347644  0.895084 -1.457772
    2000-01-02  0.477167  0.464013 -1.974695
    2000-01-03  1.419741  0.470735 -0.309796
    2000-01-04  0.641400  0.838864 -0.112582
    2000-01-05 -1.313405 -0.678250 -0.306318
    2000-01-06 -0.137357 -0.723145  0.982987
    2000-01-07 -1.208429 -0.672240  1.331291
    2000-01-08 -0.539854 -0.184864 -1.056217
    

    .

    To me this isn't an ideal solution, as we can only indexing the DataFrame by one thing! Worryingly the docs seem to suggest you can only index a DataFrame by one thing, at least using axes:

    Pass the axes keyword with a list of dimension (currently must by exactly 1 less than the total dimensions of the object).

    I may be reading this incorrectly, in which case hopefully someone can prove me wrong!

    .

    Note: One way I have found to index a DataFrame by two things (index and columns), is to convert it to a Panel, which can then retrieve using two indices. However then we have to convert to the selected subpanel to a DataFrame each time items are retrieved... again, not ideal.

提交回复
热议问题