Fastest save and load options for a numpy array

前端未结

关注

 4  551

春和景丽 2020-11-29 02:36

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right

4条回答

执念已碎 (楼主)

2020-11-29 03:17

Here is a comparison with PyTables.

I cannot get up to (int(1e3), int(1e6) due to memory restrictions. Therefore, I used a smaller array:

data = np.random.random((int(1e3), int(1e5)))

NumPy save:

%timeit np.save('array.npy', data)
1 loops, best of 3: 4.26 s per loop

NumPy load:

%timeit data2 = np.load('array.npy')
1 loops, best of 3: 3.43 s per loop

PyTables writing:

%%timeit
with tables.open_file('array.tbl', 'w') as h5_file:
    h5_file.create_array('/', 'data', data)

1 loops, best of 3: 4.16 s per loop

PyTables reading:

 %%timeit
 with tables.open_file('array.tbl', 'r') as h5_file:
      data2 = h5_file.root.data.read()

 1 loops, best of 3: 3.51 s per loop

The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.

Writing:

Maximum write speed: 241.6 MB/s
PyTables write speed: 183.4 MB/s

Reading:

Maximum read speed: 250.2
PyTables read speed: 217.4

Compression does not really help due to the randomness of the data:

%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
    h5_file.create_carray('/', 'data', obj=data)
1 loops, best of 3: 4.08 s per loop

Reading of the compressed data becomes a bit slower:

%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
    data2 = h5_file.root.data.read()

1 loops, best of 3: 4.01 s per loop

This is different for regular data:

 reg_data = np.ones((int(1e3), int(1e5)))

Writing is significantly faster:

%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
    h5_file.create_carray('/', 'reg_data', obj=reg_data)

1 loops, best of 3: 849 ms per loop

The same holds true for reading:

%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
    reg_data2 = h5_file.root.reg_data.read()

1 loops, best of 3: 1.7 s per loop

Conclusion: The more regular your data the faster it should get using PyTables.

0 讨论(0)

查看其它4个回答