When reading huge HDF5 file with “pandas.read_hdf() ”, why do I still get MemoryError even though I read in chunks by specifying chunksize?

后端 未结 2 1961
盖世英雄少女心
盖世英雄少女心 2020-12-14 20:35

Problem description:

I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happen

相关标签:
2条回答
  • 2020-12-14 20:52

    So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.

    300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

    In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
    Out[1]: 2.2351741790771484
    

    So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

    So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

    In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
    
    In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
    
    In [3]: store = pd.HDFStore('test.h5')
    
    In [4]: nrows = store.get_storer('df').nrows
    
    In [6]: chunksize = 100
    
    In [7]: for i in xrange(nrows//chunksize + 1):
                chunk = store.select('df',
                                     start=i*chunksize,
                                     stop=(i+1)*chunksize)
                # work on the chunk    
    
    In [8]: store.close()
    
    0 讨论(0)
  • 2020-12-14 21:15

    If you use default fixed format to save your data, you need to use store.get_storer('df').shape[0] to get nrows.

    0 讨论(0)
提交回复
热议问题