When reading huge HDF5 file with “pandas.read_hdf() ”, why do I still get MemoryError even though I read in chunks by specifying chunksize?

后端未结

关注

 2  1968

Problem description:

I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happen

相关标签:

2条回答

夕颜

2020-12-14 20:52

So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.

300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0) Out[1]: 2.2351741790771484

So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB')) In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True) In [3]: store = pd.HDFStore('test.h5') In [4]: nrows = store.get_storer('df').nrows In [6]: chunksize = 100 In [7]: for i in xrange(nrows//chunksize + 1): chunk = store.select('df', start=i*chunksize, stop=(i+1)*chunksize) # work on the chunk In [8]: store.close()

0 讨论(0)

发布评论:

提交评论

加载中...

故里飘歌

2020-12-14 21:15

If you use default fixed format to save your data, you need to use store.get_storer('df').shape[0] to get nrows.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复