问题
I need to merge a number of datasets, each contained in a separate file, into another dataset belonging to a final file. The order of the data in the partial dataset is not preserved when they get copied in the final one - the data in the partial datasets is 'mapped' into the final one through indices. I created two lists, final_indices and partial_indices, and wrote:
final_dataset = final_hdf5file['dataset']
partial_dataset = partial_hdf5file['dataset']
# here partial ad final_indices are lists.
final_dataset[final_indices] = partial_dataset[partial_indices]
the problem with this is that the performance is quite bad - and the reason is that final_ and partial_indices have both to be lists. my workaround has been to create two np arrays from the final and partial datasets, and use np arrays as indices.
final_array = np.array(final_dataset)
partial_array = np.array(partial_dataset)
# here partial ad final_indices are nd arrays.
final_array[final_indices] = partial_array[partial_indices]
The final array is then re-written to the final dataset.
final_dataset[...] = final_array
However, it seems to me rather inelegant to do so.
Is it possible to use np.arrays as indices in a h5py dataset?
回答1:
So you are doing fancy-indexing for both the read and write:
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
It warns that it can be slow with long lists.
I can see where reading and writing the whole sets, and doing the mapping on arrays will be faster, though I haven't actually tested that. The read/writing is faster, as is the mapping
http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data
I would use the slice notation (or value) to load the datasets, but that's a minor point.
final_array = final_dataset[:]
Hide the code in a function if it looks inelegant.
This oneliner might work (I haven't tested it). The RHS is more likely to work.
final_dataset[:][final_indices] = partial_dataset[:][partial_indices]
来源:https://stackoverflow.com/questions/47888392/is-it-possible-to-use-np-arrays-as-indices-in-h5py-datasets