问题
I just came across this issue when adding a multi-index to my pandas dataframe. I am using the pandas HDFStore with the option format='table', which I prefer because the saved data frame is easier to understand and load when not using pandas. (For details see this SO answer: Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers .)
But I ran into a problem because I was setting the multi-index using drop=False when calling set_index, which keeps the index columns as dataframe columns. This was fine until I put the dataframe to the store using format='table'. Using format='fixed' worked fine. But format='table' gave me an error with duplicate column names. I avoided the error by dropping the redundant columns before putting and restoring the columns after getting.
Here is the write/read pair of functions that I now use:
def write_df_without_index_columns(store, name, df):
if isinstance(df.index, pd.MultiIndex):
# drop any columns that are duplicates of index columns
redundant_columns = set(df.index.names).intersection(set(df.columns))
if redundant_columns:
df = df.copy(deep=True)
df.drop(list(redundant_columns), axis=1, inplace=True)
store.put(name, df,
format='table',
data_columns=True)
def read_df_add_index_columns(store, name, default_value):
df = store.get(name)
if isinstance(df.index, pd.MultiIndex):
# remember the MultiIndex column names
index_columns = df.index.names
# put the MultiIndex columns into the data frame
df.reset_index(drop=False, inplace=True)
# now put the MultiIndex columns back into the index
df.set_index(index_columns, drop=False, inplace=True)
return df
My question: is there a better way to do this? I expect to have a data frame with millions of rows, so I do not want this to be too inefficient.
来源:https://stackoverflow.com/questions/44121688/store-multi-index-pandas-dataframe-with-hdf5-table-format