Fastest save and load options for a numpy array

前端 未结 4 549
春和景丽
春和景丽 2020-11-29 02:36

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right

相关标签:
4条回答
  • 2020-11-29 03:10

    According to my experience, np.save()&np.load() is the fastest solution when trasfering data between hard disk and memory so far. I've heavily relied my data loading on database and HDFS system before I realized this conclusion. My tests shows that: The database data loading(from hard disk to memory) bandwidth could be around 50 MBps(Byets/Second), but the np.load() bandwidth is almost same as my hard disk maximum bandwidth: 2GBps(Byets/Second). Both test environments use the simplest data structure.

    And I don't think it's a problem to use several seconds to loading an array with shape: (1e3, 1e6). E.g. Your array shape is (1000, 1000000), its data type is float128, then the pure data size is (128/8)*1000*1,000,000=16,000,000,000=16GBytes and if it takes 4 seconds, Then your data loading bandwidth is 16GBytes/4Seconds = 4GBps. SATA3 maximum bandwidth is 600MBps=0.6GBps, your data loading bandwidth is already 6 times of it, your data loading performance almost could compete with DDR's maximum bandwidth, what else do you want?

    So my final conclusion is:

    Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.

    I've also tested the HDF5 , and found that it's much slower than np.load() and np.save(), so use np.save()&np.load() if you've enough DDR memory in your platfrom.

    0 讨论(0)
  • 2020-11-29 03:17

    Here is a comparison with PyTables.

    I cannot get up to (int(1e3), int(1e6) due to memory restrictions. Therefore, I used a smaller array:

    data = np.random.random((int(1e3), int(1e5)))
    

    NumPy save:

    %timeit np.save('array.npy', data)
    1 loops, best of 3: 4.26 s per loop
    

    NumPy load:

    %timeit data2 = np.load('array.npy')
    1 loops, best of 3: 3.43 s per loop
    

    PyTables writing:

    %%timeit
    with tables.open_file('array.tbl', 'w') as h5_file:
        h5_file.create_array('/', 'data', data)
    
    1 loops, best of 3: 4.16 s per loop
    

    PyTables reading:

     %%timeit
     with tables.open_file('array.tbl', 'r') as h5_file:
          data2 = h5_file.root.data.read()
    
     1 loops, best of 3: 3.51 s per loop
    

    The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.

    Writing:

    Maximum write speed: 241.6 MB/s
    PyTables write speed: 183.4 MB/s
    

    Reading:

    Maximum read speed: 250.2
    PyTables read speed: 217.4
    

    Compression does not really help due to the randomness of the data:

    %%timeit
    FILTERS = tables.Filters(complib='blosc', complevel=5)
    with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
        h5_file.create_carray('/', 'data', obj=data)
    1 loops, best of 3: 4.08 s per loop
    

    Reading of the compressed data becomes a bit slower:

    %%timeit
    with tables.open_file('array.tbl', 'r') as h5_file:
        data2 = h5_file.root.data.read()
    
    1 loops, best of 3: 4.01 s per loop
    

    This is different for regular data:

     reg_data = np.ones((int(1e3), int(1e5)))
    

    Writing is significantly faster:

    %%timeit
    FILTERS = tables.Filters(complib='blosc', complevel=5)
    with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
        h5_file.create_carray('/', 'reg_data', obj=reg_data)
    

    1 loops, best of 3: 849 ms per loop

    The same holds true for reading:

    %%timeit
    with tables.open_file('array.tbl', 'r') as h5_file:
        reg_data2 = h5_file.root.reg_data.read()
    
    1 loops, best of 3: 1.7 s per loop
    

    Conclusion: The more regular your data the faster it should get using PyTables.

    0 讨论(0)
  • 2020-11-29 03:23

    For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

    • NumPy.memmap, maps big arrays to binary form
      • Pros :
        • No dependency other than Numpy
        • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
      • Cons :
        • Chunks of your array are limited to 2.5G
        • Still limited by Numpy throughput
    • Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py

      • Pros :
        • Format supports compression, indexing, and other super nice features
        • Apparently the ultimate PetaByte-large file format
      • Cons :
        • Learning curve of having a hierarchical format ?
        • Have to define what your performance needs are (see later)
    • Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)

      • Pros:
        • It's Pythonic ! (haha)
        • Supports all sorts of objects
      • Cons:
        • Probably slower than others (because aimed at any objects not arrays)

    Numpy.memmap

    From the docs of NumPy.memmap :

    Create a memory-map to an array stored in a binary file on disk.

    Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

    The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.


    HDF5 arrays

    From the h5py doc

    Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

    The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

    0 讨论(0)
  • 2020-11-29 03:30

    I've compared a few methods using perfplot (one of my projects). Here are the results:

    Writing

    For large arrays, all methods are about equally fast. The file sizes are also equal which is to be expected since the input array are random doubles and hence hardly compressible.

    Code to reproduce the plot:

    import perfplot
    import pickle
    import numpy
    import h5py
    import tables
    import zarr
    
    
    def npy_write(data):
        numpy.save("npy.npy", data)
    
    
    def hdf5_write(data):
        f = h5py.File("hdf5.h5", "w")
        f.create_dataset("data", data=data)
    
    
    def pickle_write(data):
        with open("test.pkl", "wb") as f:
            pickle.dump(data, f)
    
    
    def pytables_write(data):
        f = tables.open_file("pytables.h5", mode="w")
        gcolumns = f.create_group(f.root, "columns", "data")
        f.create_array(gcolumns, "data", data, "data")
        f.close()
    
    
    def zarr_write(data):
        zarr.save("out.zarr", data)
    
    
    perfplot.save(
        "write.png",
        setup=numpy.random.rand,
        kernels=[npy_write, hdf5_write, pickle_write, pytables_write, zarr_write],
        n_range=[2 ** k for k in range(28)],
        xlabel="len(data)",
        logx=True,
        logy=True,
        equality_check=None,
    )
    

    Reading

    pickles, pytables and hdf5 are roughly equally fast; pickles and zarr are slower for large arrays.

    Code to reproduce the plot:

    import perfplot
    import pickle
    import numpy
    import h5py
    import tables
    import zarr
    
    
    def setup(n):
        data = numpy.random.rand(n)
        # write all files
        #
        numpy.save("out.npy", data)
        #
        f = h5py.File("out.h5", "w")
        f.create_dataset("data", data=data)
        f.close()
        #
        with open("test.pkl", "wb") as f:
            pickle.dump(data, f)
        #
        f = tables.open_file("pytables.h5", mode="w")
        gcolumns = f.create_group(f.root, "columns", "data")
        f.create_array(gcolumns, "data", data, "data")
        f.close()
        #
        zarr.save("out.zip", data)
    
    
    def npy_read(data):
        return numpy.load("out.npy")
    
    
    def hdf5_read(data):
        f = h5py.File("out.h5", "r")
        out = f["data"][()]
        f.close()
        return out
    
    
    def pickle_read(data):
        with open("test.pkl", "rb") as f:
            out = pickle.load(f)
        return out
    
    
    def pytables_read(data):
        f = tables.open_file("pytables.h5", mode="r")
        out = f.root.columns.data[()]
        f.close()
        return out
    
    
    def zarr_read(data):
        return zarr.load("out.zip")
    
    
    perfplot.show(
        setup=setup,
        kernels=[
            npy_read,
            hdf5_read,
            pickle_read,
            pytables_read,
            zarr_read,
        ],
        n_range=[2 ** k for k in range(28)],
        xlabel="len(data)",
        logx=True,
        logy=True,
    )
    
    0 讨论(0)
提交回复
热议问题