Storing pandas DataFrame with mixed data and category into hdf5

后端 未结 1 2002
轮回少年
轮回少年 2020-12-31 11:06

I want to store a dataFrame with different columns into an hdf5 file (find an excerpt with data types below).

In  [1]: mydf
Out [1]:
endTime             uin         


        
相关标签:
1条回答
  • 2020-12-31 11:32

    You have two problems:

    1. You want to store categorical data in a HDF5 file;
    2. You're trying to store arbitrary objects (i.e. stationList) in a HDF5 file.

    As you discovered, categorical data is (currently?) only supported in the "table" format for HDF5.

    However, storing arbitrary objects (list of strings, etc.) is really not something that is supported by the HDF5 format itself. Pandas working around that for you by serializing these objects using pickle, and then storing the pickle as an arbitrary-length string (which is not supported by all HDF5 formats, I think). But that will be slow and inefficient, and will never be supported well by HDF5.

    In my mind, you have two options:

    1. Pivot your data so you have one row of data by station name. Then you can store everything in a table-format HDF5 file. (This is a good practice in general; see Hadley Wickham on Tidy Data.)
    2. If you really want to keep this format, then you might as well save the whole dataframe using to_pickle(). This will have no problem dealing with any kind of object (e.g. list of strings, etc.) you throw at it.

    Personally, I would recommend option 1. You get to use a fast, binary file format. And the pivot will also make other operations with your data easier.

    0 讨论(0)
提交回复
热议问题