Python (pandas): store a data frame in hdf5 with a multi index

大城市里の小女人 提交于 2019-12-03 15:35:56

Here's an example

In [8]: pd.__version__
Out[8]: '0.14.1'

In [9]: np.__version__
Out[9]: '1.8.1'

In [10]: import sys

In [11]: sys.version
Out[11]: '2.7.3 (default, Jan  7 2013, 09:17:50) \n[GCC 4.4.5]'

In [4]: df = DataFrame(np.arange(9).reshape(9,-1),index=pd.MultiIndex.from_product([list('abc'),date_range('20140721',periods=3)],names=['symbol','date']),columns=['value'])

In [5]: df
Out[5]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2
b      2014-07-21      3
       2014-07-22      4
       2014-07-23      5
c      2014-07-21      6
       2014-07-22      7
       2014-07-23      8

In [6]: df.to_hdf('test.h5','df',mode='w',format='table')

In [7]: pd.read_hdf('test.h5','df',where='date=20140722')
Out[7]: 
                   value
symbol date             
a      2014-07-22      1
b      2014-07-22      4
c      2014-07-22      7

In [12]: pd.read_hdf('test.h5','df',where='symbol="a"')
Out[12]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2

Jeff has completely the right answer. I found a couple gotchas that I wanted to share and it won't fit in a comment - please consider this just a long form additional comment :)

(Pytables Versions) If you get missing attribute or method errors when trying to write the hdf file you may want to try updating your PyTables version. Pandas (as of this writing) leverages Pytables and I found at least one pairing of versions that threw some odd errors until I updated Pytables and reloaded.

(Data types) This may be fixed in Python 3 but in 2.7x the to_hdf has problems with unicode, with mixed data type columns, and with NaN values in floating point. Below is an example utility function to clean up a DataFrame in preparation for writing to_hdf that fixed all those problems for me. Note that this replaces NaN with zero, which was appropriate for my application but you may want to adjust that:

def clean_cols_for_hdf(data):
    types = data.apply(lambda x: pd.lib.infer_dtype(x.values))
    for col in types[types=='mixed'].index:
        data[col] = .data[col].astype(str)
    data[<your appropriate columns here>].fillna(0,inplace=True)
    return data

Some of this just extends one of Jeff's comments as well. Jeff is awesome, please excuse the added answer but I wanted to chip in some details that fixed things for me.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!