问题
I am writing an application for streaming data from a sensor, and then processing the data in various ways. These processing components include visualizing the data, some number crunching (linear algebra), and also writing the data to disk in an HDF5 format. Ideally each of these components will be its own module, all run in the same Python process so that IPC is not an issue. This leads me to the question of how to efficiently store the streaming data.
The datasets are quite large (~5Gb), and so I would like to minimize the number of copies of the data in memory by sharing it between the components that need access. If all components used straight ndarray
s, then this should be straightforward: give one of the processes the data, then give everyone else a copy using ndarray.view()
.
However, the component writing data to disk stores the data in an HDF5 Dataset
. These are interoperable with ndarray
s in lots of ways, but it doesn't appear that creating a view()
works as with ndarrary
s.
Observe with ndarray
s:
>>> source = np.zeros((10,))
>>> view = source.view()
>>> source[0] = 1
>>> view[0] == 1
True
>>> view.base is source
True
However, this doesn't work with HDF5 Dataset
s:
>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source_dset = file.create_dataset('source', (10,), dtype=np.float64)
>>> view_dset = source_dset.value.view()
>>> source_dset[0] = 1
>>> view_dset[0] == 1
False
>>> view_dset.base is source_dset.value
False
It also doesn't work to just assign the Dataset.value
itself, not a view
of it.
>>> view_dset = source_dset.value
>>> source_dset[0] = 2
>>> view_dset[0] == 2
False
>>> view_dset.base is source_dset.value
False
So my question is this: Is there a way to have an ndarray
share memory with an HDF5 Dataset
, just as two ndarray
s can share memory?
My guess is that this is unlikely to work, probably because of some subtlety in how HDF5 stores arrays in memory. But it is a bit confusing to me, especially that type(source_dset.value) == numpy.ndarray
and the OWNDATA
flag of Dataset.value.view()
is actually False
. Who owns the memory that the view
is interpreting?
Version details: Python 3, NumPy version 1.9.1, h5py version 2.3.1, HDF5 version 1.8.13, Linux.
Other details: HDF5 file is chunked.
EDIT:
After playing around with this a bit more, it seems like one possible solution is to give other components a reference to the HDF5 Dataset
itself. This doesn't seem to copy any memory (at least not according to top
), and changes in the source Dataset
are reflected in the view.
>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source = file.create_dataset('source', (10,), dtype=np.float64)
>>> class Container():
... def __init__(self, source_dset):
... self.dset = source_dset
...
>>> container = Containter(source)
>>> source[0] = 1
>>> container.dset[0] == 1
True
I'm reasonably happy with this solution (as long as the memory savings pan out), but I'm still curious why the view
approach above doesn't work.
回答1:
The short answer is that you can't share memory between a numpy
array and an h5py
dataset. While they have similar API (at least when it comes to indexing), they don't have a compatible memory layout. In fact, apart from sort of cache, the dataset isn't even in memory - it's on the file.
First, I don't see why you need to use source.view()
with a numpy
array. Yes, when selecting from an array or reshaping an array, numpy
tries to return a view
as opposed to a copy. But most (all?) examples of .view
involve some sort of transformation, such as with a dtype
. Can you point to code or documentation example with just .view()
?
I don't have much experience with h5py
, but its documentation talks about it providing a thin ndarray-like wrapper around h5
file objects. Your DataSet
is not an ndarray
. For example it lacks many of the ndarray
methods, including view
.
But indexing a DataSet
returns an ndarray
, e.g. view_dset[:]
. So does .value
. The first part of its documentation (via view_dset.value??
in IPython
):
Type: property
String form: <property object at 0xb4ee37d4>
Docstring: Alias for dataset[()]
...
Note that when you assign new values to the DataSet you have to index source_dset
directly. Indexing the value does not work - except to modify the array. It doesn't change the file object.
And creating a dataset from an array does not link them any tighter:
x = np.arange(10)
xdset = file.create_dataset('x', data=x)
x1 = xdset[:]
x
, xdset
and x1
are all independent - changing one does not change the others.
Regarding the times, compare
timeit np.sum(x) # 11.7 µs
timeit np.sum(xdset) # 203 µs
timeit xdset.value # 173 µs
timeit np.sum(x1) # same as for x
The sum
of an array is much faster than for a dataset. Most of the extra time is involved in creating an array from the dataset.
来源:https://stackoverflow.com/questions/27409512/how-to-share-memory-from-an-hdf5-dataset-with-a-numpy-ndarray