Fastest save and load options for a numpy array

前端 未结 4 564
春和景丽
春和景丽 2020-11-29 02:36

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right

4条回答
  •  孤街浪徒
    2020-11-29 03:23

    For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

    • NumPy.memmap, maps big arrays to binary form
      • Pros :
        • No dependency other than Numpy
        • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
      • Cons :
        • Chunks of your array are limited to 2.5G
        • Still limited by Numpy throughput
    • Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py

      • Pros :
        • Format supports compression, indexing, and other super nice features
        • Apparently the ultimate PetaByte-large file format
      • Cons :
        • Learning curve of having a hierarchical format ?
        • Have to define what your performance needs are (see later)
    • Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)

      • Pros:
        • It's Pythonic ! (haha)
        • Supports all sorts of objects
      • Cons:
        • Probably slower than others (because aimed at any objects not arrays)

    Numpy.memmap

    From the docs of NumPy.memmap :

    Create a memory-map to an array stored in a binary file on disk.

    Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

    The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.


    HDF5 arrays

    From the h5py doc

    Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

    The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

提交回复
热议问题