问题
Currently, I load HDF5 data in python via h5py and read a dataset into memory.
f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]
This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:
f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]
I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.
My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.
I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.
Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?
回答1:
I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.
If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.
I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.
import h5py as h5
import time
import numpy as np
import h5py_cache as h5c
def Reading():
File_Name_HDF5='Test.h5'
t1=time.time()
f = h5.File(File_Name_HDF5, 'r',driver='core')
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5.File(File_Name_HDF5, 'r')
dset = f['Test'][:]
print(time.time()-t1)
f.close()
if __name__ == "__main__":
Reading()
This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)
来源:https://stackoverflow.com/questions/48413103/hdf5-core-driver-h5fd-core-loading-selected-datasets