I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file contain
I usually use ipython and h5copy tool togheter, this is much faster compared to a pure python solution. Once installed h5copy.
#PLESE NOTE THIS IS IPYTHON CONSOLE CODE NOT PURE PYTHON
import h5py
#for every dataset Dn.h5 you want to merge to Output.h5
f = h5py.File('D1.h5','r+') #file to be merged
h5_keys = f.keys() #get the keys (You can remove the keys you don't use)
f.close() #close the file
for i in h5_keys:
!h5copy -i 'D1.h5' -o 'Output.h5' -s {i} -d {i}
To completely automatize the process supposing you are working in the folder were the files to be merged are stored:
import os
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
f = h5py.File(i,'r+')
d_struct[i] = f.keys()
f.close()
# A) empty all the groups in the new .h5 file
for i in d_names:
for j in d_struct[i]:
!h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}
If you want to keep the previous dataset separate inside the output.h5, you have to create the group first using the flag -p:
# B) Create a new group in the output.h5 file for every input.h5 file
for i in d_names:
dataset = d_struct[i][0]
newgroup = '%s/%s' %(i[:-3],dataset)
!h5copy -i '{i}' -o 'output.h5' -s {dataset} -d {newgroup} -p
for j in d_struct[i][1:]:
newgroup = '%s/%s' %(i[:-3],j)
!h5copy -i '{i}' -o 'output.h5' -s {j} -d {newgroup}