I have a script that generates two-dimensional numpy
array
s with dtype=float
and shape on the order of (1e3, 1e6)
. Right
Here is a comparison with PyTables.
I cannot get up to (int(1e3), int(1e6)
due to memory restrictions.
Therefore, I used a smaller array:
data = np.random.random((int(1e3), int(1e5)))
NumPy save
:
%timeit np.save('array.npy', data)
1 loops, best of 3: 4.26 s per loop
NumPy load
:
%timeit data2 = np.load('array.npy')
1 loops, best of 3: 3.43 s per loop
PyTables writing:
%%timeit
with tables.open_file('array.tbl', 'w') as h5_file:
h5_file.create_array('/', 'data', data)
1 loops, best of 3: 4.16 s per loop
PyTables reading:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
data2 = h5_file.root.data.read()
1 loops, best of 3: 3.51 s per loop
The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.
Writing:
Maximum write speed: 241.6 MB/s
PyTables write speed: 183.4 MB/s
Reading:
Maximum read speed: 250.2
PyTables read speed: 217.4
Compression does not really help due to the randomness of the data:
%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
h5_file.create_carray('/', 'data', obj=data)
1 loops, best of 3: 4.08 s per loop
Reading of the compressed data becomes a bit slower:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
data2 = h5_file.root.data.read()
1 loops, best of 3: 4.01 s per loop
This is different for regular data:
reg_data = np.ones((int(1e3), int(1e5)))
Writing is significantly faster:
%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
h5_file.create_carray('/', 'reg_data', obj=reg_data)
1 loops, best of 3: 849 ms per loop
The same holds true for reading:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
reg_data2 = h5_file.root.reg_data.read()
1 loops, best of 3: 1.7 s per loop
Conclusion: The more regular your data the faster it should get using PyTables.