Fastest save and load options for a numpy array

前端 未结 4 551
春和景丽
春和景丽 2020-11-29 02:36

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right

4条回答
  •  执念已碎
    2020-11-29 03:17

    Here is a comparison with PyTables.

    I cannot get up to (int(1e3), int(1e6) due to memory restrictions. Therefore, I used a smaller array:

    data = np.random.random((int(1e3), int(1e5)))
    

    NumPy save:

    %timeit np.save('array.npy', data)
    1 loops, best of 3: 4.26 s per loop
    

    NumPy load:

    %timeit data2 = np.load('array.npy')
    1 loops, best of 3: 3.43 s per loop
    

    PyTables writing:

    %%timeit
    with tables.open_file('array.tbl', 'w') as h5_file:
        h5_file.create_array('/', 'data', data)
    
    1 loops, best of 3: 4.16 s per loop
    

    PyTables reading:

     %%timeit
     with tables.open_file('array.tbl', 'r') as h5_file:
          data2 = h5_file.root.data.read()
    
     1 loops, best of 3: 3.51 s per loop
    

    The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.

    Writing:

    Maximum write speed: 241.6 MB/s
    PyTables write speed: 183.4 MB/s
    

    Reading:

    Maximum read speed: 250.2
    PyTables read speed: 217.4
    

    Compression does not really help due to the randomness of the data:

    %%timeit
    FILTERS = tables.Filters(complib='blosc', complevel=5)
    with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
        h5_file.create_carray('/', 'data', obj=data)
    1 loops, best of 3: 4.08 s per loop
    

    Reading of the compressed data becomes a bit slower:

    %%timeit
    with tables.open_file('array.tbl', 'r') as h5_file:
        data2 = h5_file.root.data.read()
    
    1 loops, best of 3: 4.01 s per loop
    

    This is different for regular data:

     reg_data = np.ones((int(1e3), int(1e5)))
    

    Writing is significantly faster:

    %%timeit
    FILTERS = tables.Filters(complib='blosc', complevel=5)
    with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
        h5_file.create_carray('/', 'reg_data', obj=reg_data)
    

    1 loops, best of 3: 849 ms per loop

    The same holds true for reading:

    %%timeit
    with tables.open_file('array.tbl', 'r') as h5_file:
        reg_data2 = h5_file.root.reg_data.read()
    
    1 loops, best of 3: 1.7 s per loop
    

    Conclusion: The more regular your data the faster it should get using PyTables.

提交回复
热议问题