read HDF5 file to pandas DataFrame with conditions

后端 未结 2 1046
逝去的感伤
逝去的感伤 2020-12-16 18:09

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better

相关标签:
2条回答
  • 2020-12-16 18:53

    You can do this using pandas.read_hdf (here), with the optional parameter of where.
    For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

    0 讨论(0)
  • 2020-12-16 18:57

    The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

    Furthermore, A must be declared as a data_column:

    df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
              format='table')
    

    or, to specify all columns as (queryable) data columns:

    df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
              format='table')
    

    Then you could use

    pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')
    

    to select rows where the value column A is 1, 3 or 4. For example,

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({
        'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
        'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
        'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
        'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})
    
    df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
              format='table')
    
    print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))
    

    yields

        A  B   C   D
    0   1  0  34  11
    2   3  1  35  22
    3   4  1  34  15
    5   1  0  34  15
    7   3  0  34  15
    8   4  1  12  14
    10  1  0  32  13
    

    If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

    where='A in {}'.format(vals)
    
    0 讨论(0)
提交回复
热议问题