Unable to save DataFrame to HDF5 (“object header message is too large”)

前端 未结 4 1126
遇见更好的自我
遇见更好的自我 2020-12-10 05:05

I have a DataFrame in Pandas:

In [7]: my_df
Out[7]: 

Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airp         


        
4条回答
  •  悲哀的现实
    2020-12-10 05:45

    Although this thread is more than 5 years old the problem is still relevant. It´s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed' isn´t an option if one wants to choose which columns to read from the HDFStore later.

    Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Series is put to the HDFStore that contains the information to which table a column belongs.

    def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
        """Write a `pandas.DataFrame` with a large number of columns
        to one HDFStore.
    
        Parameters
        -----------
        filename : str
            name of the HDFStore
        data : pandas.DataFrame
            data to save in the HDFStore
        columns: list
            a list of columns for storing. If set to `None`, all 
            columns are saved.
        maxColSize : int (default=2000)
            this number defines the maximum possible column size of 
            a table in the HDFStore.
    
        """
        import numpy as np
        from collections import ChainMap
        store = pd.HDFStore(filename, **kwargs)
        if columns is None:
            columns = data.columns
        colSize = columns.shape[0]
        if colSize > maxColSize:
            numOfSplits = np.ceil(colSize / maxColSize).astype(int)
            colsSplit = [
                columns[i * maxColSize:(i + 1) * maxColSize]
                for i in range(numOfSplits)
            ]
            _colsTabNum = ChainMap(*[
                dict(zip(columns, ['data{}'.format(num)] * colSize))
                for num, columns in enumerate(colsSplit)
            ])
            colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
            for num, cols in enumerate(colsSplit):
                store.put('data{}'.format(num), data[cols], format='table')
            store.put('colsTabNum', colsTabNum, format='fixed')
        else:
            store.put('data', data[columns], format='table')
        store.close()
    

    DataFrames stored into a HDFStore with the function above can be read with the following function.

    def read_hdf_wideDf(filename, columns=None, **kwargs):
        """Read a `pandas.DataFrame` from a HDFStore.
    
        Parameter
        ---------
        filename : str
            name of the HDFStore
        columns : list
            the columns in this list are loaded. Load all columns, 
            if set to `None`.
    
        Returns
        -------
        data : pandas.DataFrame
            loaded data.
    
        """
        store = pd.HDFStore(filename)
        data = []
        colsTabNum = store.select('colsTabNum')
        if colsTabNum is not None:
            if columns is not None:
                tabNums = pd.Series(
                    index=colsTabNum[columns].values,
                    data=colsTabNum[columns].data).sort_index()
                for table in tabNums.unique():
                    data.append(
                        store.select(table, columns=tabsNum[table], **kwargs))
            else:
                for table in colsTabNum.unique():
                    data.append(store.select(table, **kwargs))
            data = pd.concat(data, axis=1).sort_index(axis=1)
        else:
            data = store.select('data', columns=columns)
        store.close()
        return data
    

提交回复
热议问题