h5py: Correct way to slice array datasets

后端未结

关注

 3  1174

I\'m a bit confused here:

As far as I have understood, h5py\'s .value method reads an entire dataset and dumps it into an array, which is slow and disco

相关标签:

3条回答

天命终不由人

2020-12-24 08:55
For fast slicing with h5py, stick to the "plain-vanilla" slice notation:
```
file['test'][0:300000]
```
or, for example, reading every other element:
```
file['test'][0:300000:2]
```
Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

See also:

[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

[2] https://github.com/h5py/h5py/issues/293
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-24 09:03

Based on the title of your post, the 'correct' way to slice array datasets is to use the builtin slice notation

All of your answers would be equivalent to file["test"][:]

[:] selects all elements in the array

More information about slicing notation can be found here, http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

I use hdf5 + python often, I've never had to use the .value methods. When you access a dataset in an array like such as myarr = file["test"]

python copies the dataset in the hdf5 into an array for you already.

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-24 09:08

The .value method is copying the data to memory as a numpy array. Try comparing type(file["test"]) with type(file["test"].value): the former should be an HDF5 dataset, the latter a numpy array.

I'm not familiar enough with the h5py or HDF5 internals to tell you exactly why certain dataset operations are slow; but the reason those two are different is that in one case you're slicing a numpy array in memory, and in the other slicing an HDF5 dataset from disk.

0 讨论(0)
发布评论:

提交评论
- 加载中...