Indexing and Data Columns in Pandas/PyTables

后端 未结 1 747
Happy的楠姐
Happy的楠姐 2021-02-06 01:02

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing

I\'m really confused about this concept of Data columns in Pandas HDF5 IO. Plus there\'s very little to no i

1条回答
  •  自闭症患者
    2021-02-06 01:47

    You should just try it.

    In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])
    
    In [23]: store = pd.HDFStore('test.h5',mode='w')
    
    In [24]: store.append('df_only_indexables',df)
    
    In [25]: store.append('df_with_data_columns',df,data_columns=True)
    
    In [26]: store.append('df_no_index',df,data_columns=True,index=False)
    
    In [27]: store
    Out[27]: 
    
    File path: test.h5
    /df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
    /df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
    /df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
    
    In [28]: store.close()
    
    • you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.

    • If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.

    • If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).

    To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).

    /df_no_index/table (Table(5,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "A": Float64Col(shape=(), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
       [A_dtype := 'float64',
        A_kind := ['A'],
        B_dtype := 'float64',
        B_kind := ['B'],
        CLASS := 'TABLE',
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'A',
        FIELD_2_FILL := 0.0,
        FIELD_2_NAME := 'B',
        NROWS := 5,
        TITLE := '',
        VERSION := '2.7',
        index_kind := 'integer']
    /df_only_indexables/table (Table(5,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoindex := True
      colindexes := {
        "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
      /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
       [CLASS := 'TABLE',
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'values_block_0',
        NROWS := 5,
        TITLE := '',
        VERSION := '2.7',
        index_kind := 'integer',
        values_block_0_dtype := 'float64',
        values_block_0_kind := ['A', 'B']]
    /df_with_data_columns/table (Table(5,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "A": Float64Col(shape=(), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoindex := True
      colindexes := {
        "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
      /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
       [A_dtype := 'float64',
        A_kind := ['A'],
        B_dtype := 'float64',
        B_kind := ['B'],
        CLASS := 'TABLE',
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'A',
        FIELD_2_FILL := 0.0,
        FIELD_2_NAME := 'B',
        NROWS := 5,
        TITLE := '',
        VERSION := '2.7',
        index_kind := 'integer']
    

    So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).

    You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).

    See the cookbook for a menagerie of questions.

    0 讨论(0)
提交回复
热议问题