问题
I have a very large pandas dataframe stored in hdf5 file, and I need to rename the columns of the dataframe.
The straightforward way is to read the dataframe in chunks using HDFStore.select, rename the columns and store the chunks to another hdf5 file.
But I think this is a stupid and inefficient way. Is there a way to directly rename the columns in hdf5 file?
回答1:
It can be done by changing the meta-data. BIG WARNING. This may corrupt your file, so you are at your own risk.
Create a store. Must be a table format. I didn't use data_columns here, but the change is only slight to rename those.
In [1]: df = DataFrame(np.random.randn(10,3),columns=list('abc'))
In [2]: df.to_hdf('test.h5','df',format='table')
In [24]: df.to_hdf('test.h5','df',format='table')
In [25]: pd.read_hdf('test.h5','df')
Out[25]:
a b c
0 1.366298 0.844646 -0.470735
1 -1.438387 -1.288432 0.250763
2 -1.290225 -0.390315 -0.138440
3 2.343019 0.632340 -0.539334
4 -1.184943 0.566479 1.977939
5 -1.530772 0.757110 -0.013930
6 -0.300345 -0.951563 -1.013957
7 -0.073975 -0.256521 1.024525
8 -0.179189 -1.767918 0.591720
9 0.641028 0.205522 1.947618
Get a handle to the table itself
In [26]: store = pd.HDFStore('test.h5')
You need to change meta-data in 2 places. First here at the top-level
In [28]: store.get_storer('df').attrs['non_index_axes']
Out[28]: [(1, ['a', 'b', 'c'])]
In [29]: store.get_storer('df').attrs.non_index_axes = [(1, ['new','b','c'])]
Then here
In [31]: store.get_storer('df').table.attrs
Out[31]:
/df/table._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 10,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['a', 'b', 'c'],
values_block_0_meta := None]
In [33]: store.get_storer('df').table.attrs.values_block_0_kind = ['new','b','c']
Close the store to save
In [34]: store.close()
In [35]: pd.read_hdf('test.h5','df')
Out[35]:
new b c
0 1.366298 0.844646 -0.470735
1 -1.438387 -1.288432 0.250763
2 -1.290225 -0.390315 -0.138440
3 2.343019 0.632340 -0.539334
4 -1.184943 0.566479 1.977939
5 -1.530772 0.757110 -0.013930
6 -0.300345 -0.951563 -1.013957
7 -0.073975 -0.256521 1.024525
8 -0.179189 -1.767918 0.591720
9 0.641028 0.205522 1.947618
来源:https://stackoverflow.com/questions/32079874/is-it-possible-to-directly-rename-pandas-dataframes-columns-stored-in-hdf5-file