How do I lazily concatenate “numpy ndarray”-like objects for sequential reading?

匿名 (未验证) 提交于 2019-12-03 10:24:21

问题:

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.

Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:

files = [h5.File(name, 'r') for name in filenames] X = np.concatenate([f['data'] for f in files], axis=0)

On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.

How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.

I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.

回答1:

flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.

When you do

arr = np.concatenate(flist, axis=0)

I imagine concatenate first does

tmep = [np.asarray(a) for a in flist]

that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with

arr = np.concatenate([a.value for a in flist], axis=0)

flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.

[a.value[:,:,:10] for a in flist]

would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].

Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.

You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!