I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happen
So the iterator is built mainly to deal with a where
clause. PyTables
returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange
on the list of rows.
300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.
In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484
So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.
So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.
In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
In [3]: store = pd.HDFStore('test.h5')
In [4]: nrows = store.get_storer('df').nrows
In [6]: chunksize = 100
In [7]: for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
# work on the chunk
In [8]: store.close()
If you use default fixed format to save your data, you need to use store.get_storer('df').shape[0]
to get nrows
.