read HDF5 file to pandas DataFrame with conditions

后端未结

关注

 2  1047

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better

相关标签:

2条回答

野趣味

2020-12-16 18:53

You can do this using pandas.read_hdf (here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

0 讨论(0)
发布评论:

提交评论
- 加载中...

北恋

2020-12-16 18:57

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

Furthermore, A must be declared as a data_column:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)

0 讨论(0)