Unable to save DataFrame to HDF5 (“object header message is too large”)

前端 未结 4 1116
遇见更好的自我
遇见更好的自我 2020-12-10 05:05

I have a DataFrame in Pandas:

In [7]: my_df
Out[7]: 

Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airp         


        
相关标签:
4条回答
  • 2020-12-10 05:31

    HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don't think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.

    0 讨论(0)
  • 2020-12-10 05:35
    ###USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
    
    ##############################################################################
    
    #Assuming that this is your model architecture. However, you may use 
    #whatever architecture, you want to (big or small; any).
    def mymodel():
        inputShape= (28, 28, 3);
        model= Sequential()
        model.add(Conv2D(20, 5, padding="same", input_shape=inputShape))
        model.add(Activation('relu'))
        model.add(Flatten())
        model.add(Dense(500))
        model.add(Activation('relu'))
        model.add(Dense(2, activation= "softmax"))
        return model
    model.fit(....)    #paramaters to start training your model
    
    
    
    
    ################################################################################
    ################################################################################
    #once your model has been trained, you want to save your model in your PC
    #use get_weights() command to get your model weights
    weigh= model.get_weights()
    
    #now, use pickle to save your model weights, instead of .h5
    #for heavy model architectures, .h5 file is unsupported.
    pklfile= "D:/modelweights.pkl"
    try:
        fpkl= open(pklfile, 'wb')    #Python 3     
        pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
        fpkl.close()
    except:
        fpkl= open(pklfile, 'w')    #Python 2      
        pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
        fpkl.close()
    
    
    
    
    ################################################################################
    ################################################################################
    #in future, you may want to load your model back
    #use pickle to load model weights
    
    pklfile= "D:/modelweights.pkl"
    try:
        f= open(pklfile)     #Python 2
    
        weigh= pickle.load(f);                
        f.close();
    except:
    
        f= open(pklfile, 'rb')     #Python 3                 
        weigh= pickle.load(f);                
        f.close();
    
    restoredmodel= mymodel()
    #use set_weights to load the modelweights into the model architecture
    restoredmodel.set_weights(weigh)
    
    
    
    
    ################################################################################
    ################################################################################
    #now, you can do your testing and evaluation- predictions
    y_pred= restoredmodel.predict(X)
    
    0 讨论(0)
  • 2020-12-10 05:39

    As of 2014, the hdf is updated

    If you are using HDF5 1.8.0 or previous releases, there is a limit on the number 
    of fields you can have in a compound datatype. 
    This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail.
    One user was able to create up to 1260 fields in a compound datatype before it failed.)
    

    As for pandas, it can save Dataframe with arbirtary number of columns with format='fixed' option, format 'table' still raises the same error as in topic. I've also tried h5py, and got the error of 'too large header' as well (though I had version > 1.8.0).

    0 讨论(0)
  • 2020-12-10 05:45

    Although this thread is more than 5 years old the problem is still relevant. It´s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed' isn´t an option if one wants to choose which columns to read from the HDFStore later.

    Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Series is put to the HDFStore that contains the information to which table a column belongs.

    def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
        """Write a `pandas.DataFrame` with a large number of columns
        to one HDFStore.
    
        Parameters
        -----------
        filename : str
            name of the HDFStore
        data : pandas.DataFrame
            data to save in the HDFStore
        columns: list
            a list of columns for storing. If set to `None`, all 
            columns are saved.
        maxColSize : int (default=2000)
            this number defines the maximum possible column size of 
            a table in the HDFStore.
    
        """
        import numpy as np
        from collections import ChainMap
        store = pd.HDFStore(filename, **kwargs)
        if columns is None:
            columns = data.columns
        colSize = columns.shape[0]
        if colSize > maxColSize:
            numOfSplits = np.ceil(colSize / maxColSize).astype(int)
            colsSplit = [
                columns[i * maxColSize:(i + 1) * maxColSize]
                for i in range(numOfSplits)
            ]
            _colsTabNum = ChainMap(*[
                dict(zip(columns, ['data{}'.format(num)] * colSize))
                for num, columns in enumerate(colsSplit)
            ])
            colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
            for num, cols in enumerate(colsSplit):
                store.put('data{}'.format(num), data[cols], format='table')
            store.put('colsTabNum', colsTabNum, format='fixed')
        else:
            store.put('data', data[columns], format='table')
        store.close()
    

    DataFrames stored into a HDFStore with the function above can be read with the following function.

    def read_hdf_wideDf(filename, columns=None, **kwargs):
        """Read a `pandas.DataFrame` from a HDFStore.
    
        Parameter
        ---------
        filename : str
            name of the HDFStore
        columns : list
            the columns in this list are loaded. Load all columns, 
            if set to `None`.
    
        Returns
        -------
        data : pandas.DataFrame
            loaded data.
    
        """
        store = pd.HDFStore(filename)
        data = []
        colsTabNum = store.select('colsTabNum')
        if colsTabNum is not None:
            if columns is not None:
                tabNums = pd.Series(
                    index=colsTabNum[columns].values,
                    data=colsTabNum[columns].data).sort_index()
                for table in tabNums.unique():
                    data.append(
                        store.select(table, columns=tabsNum[table], **kwargs))
            else:
                for table in colsTabNum.unique():
                    data.append(store.select(table, **kwargs))
            data = pd.concat(data, axis=1).sort_index(axis=1)
        else:
            data = store.select('data', columns=columns)
        store.close()
        return data
    
    0 讨论(0)
提交回复
热议问题