How to share memory from an HDF5 dataset with a NumPy ndarray

问题

I am writing an application for streaming data from a sensor, and then processing the data in various ways. These processing components include visualizing the data, some number crunching (linear algebra), and also writing the data to disk in an HDF5 format. Ideally each of these components will be its own module, all run in the same Python process so that IPC is not an issue. This leads me to the question of how to efficiently store the streaming data.

The datasets are quite large (~5Gb), and so I would like to minimize the number of copies of the data in memory by sharing it between the components that need access. If all components used straight ndarrays, then this should be straightforward: give one of the processes the data, then give everyone else a copy using ndarray.view().

However, the component writing data to disk stores the data in an HDF5 Dataset. These are interoperable with ndarrays in lots of ways, but it doesn't appear that creating a view() works as with ndarrarys.

Observe with ndarrays:

>>> source = np.zeros((10,))
>>> view = source.view()
>>> source[0] = 1
>>> view[0] == 1
True
>>> view.base is source
True

However, this doesn't work with HDF5 Datasets:

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source_dset = file.create_dataset('source', (10,), dtype=np.float64)
>>> view_dset = source_dset.value.view()
>>> source_dset[0] = 1
>>> view_dset[0] == 1
False
>>> view_dset.base is source_dset.value
False

It also doesn't work to just assign the Dataset.value itself, not a view of it.

>>> view_dset = source_dset.value
>>> source_dset[0] = 2
>>> view_dset[0] == 2
False
>>> view_dset.base is source_dset.value
False

So my question is this: Is there a way to have an ndarray share memory with an HDF5 Dataset, just as two ndarrays can share memory?

My guess is that this is unlikely to work, probably because of some subtlety in how HDF5 stores arrays in memory. But it is a bit confusing to me, especially that type(source_dset.value) == numpy.ndarray and the OWNDATA flag of Dataset.value.view() is actually False. Who owns the memory that the view is interpreting?

Version details: Python 3, NumPy version 1.9.1, h5py version 2.3.1, HDF5 version 1.8.13, Linux.

Other details: HDF5 file is chunked.

EDIT:

After playing around with this a bit more, it seems like one possible solution is to give other components a reference to the HDF5 Dataset itself. This doesn't seem to copy any memory (at least not according to top), and changes in the source Dataset are reflected in the view.

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source = file.create_dataset('source', (10,), dtype=np.float64)
>>> class Container():
    ...    def __init__(self, source_dset):
    ...        self.dset = source_dset
    ...
>>> container = Containter(source)
>>> source[0] = 1
>>> container.dset[0] == 1
True

I'm reasonably happy with this solution (as long as the memory savings pan out), but I'm still curious why the view approach above doesn't work.

回答1:

The short answer is that you can't share memory between a numpy array and an h5py dataset. While they have similar API (at least when it comes to indexing), they don't have a compatible memory layout. In fact, apart from sort of cache, the dataset isn't even in memory - it's on the file.

First, I don't see why you need to use source.view() with a numpy array. Yes, when selecting from an array or reshaping an array, numpy tries to return a view as opposed to a copy. But most (all?) examples of .view involve some sort of transformation, such as with a dtype. Can you point to code or documentation example with just .view()?

I don't have much experience with h5py, but its documentation talks about it providing a thin ndarray-like wrapper around h5 file objects. Your DataSet is not an ndarray. For example it lacks many of the ndarray methods, including view.

But indexing a DataSet returns an ndarray, e.g. view_dset[:]. So does .value. The first part of its documentation (via view_dset.value?? in IPython):

Type:            property
String form:     <property object at 0xb4ee37d4>
Docstring:       Alias for dataset[()] 
...

Note that when you assign new values to the DataSet you have to index source_dset directly. Indexing the value does not work - except to modify the array. It doesn't change the file object.

And creating a dataset from an array does not link them any tighter:

x = np.arange(10)
xdset = file.create_dataset('x', data=x)
x1 = xdset[:]

x, xdset and x1 are all independent - changing one does not change the others.

Regarding the times, compare

timeit np.sum(x)  #  11.7 µs
timeit np.sum(xdset) # 203 µs
timeit xdset.value #  173 µs
timeit np.sum(x1)  # same as for x

The sum of an array is much faster than for a dataset. Most of the extra time is involved in creating an array from the dataset.

来源：https://stackoverflow.com/questions/27409512/how-to-share-memory-from-an-hdf5-dataset-with-a-numpy-ndarray

标签

python

numpy

hdf5

h5py