HDF5 core driver (H5FD_CORE): loading selected dataset(s)

心已入冬 提交于 2019-12-13 03:16:51

问题


Currently, I load HDF5 data in python via h5py and read a dataset into memory.

f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]

This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:

f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]

I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.

My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.

I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.

Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?


回答1:


I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.

If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.

I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def Reading():
    File_Name_HDF5='Test.h5'

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r',driver='core')
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r')
    dset = f['Test'][:]
    print(time.time()-t1)
    f.close()

if __name__ == "__main__":
    Reading()

This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)



来源:https://stackoverflow.com/questions/48413103/hdf5-core-driver-h5fd-core-loading-selected-datasets

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!