How to concatenate two numpy arrays in hdf5 format?

夙愿已清 提交于 2019-12-09 20:53:32

问题


I have two numpy arrays stored in hdf5 that are 44 GB each. I need to concatenate them together but need to do it on disk because I only have 8gb ram. How would I do this?

Thank you!


回答1:


The related post is to obtain distinct datasets in the resulting file. In Python it is possible but you will need to read and write the datasets in multiple operations. Say, read 1GB from file 1, write to output file, repeat until all data is read from file 1 and do the same for file 2. You need to declare the dataset in the output file of the appropriate final size directly

d = f.create_dataset('name_of_dataset', shape=shape, dtype=dtype, data=None)

where shape is computed from the datasets and dtype matches the one from the datasets.

To write to d: d[i*N:(i+1)N] = d_from_file_1[iN:(i+1)*N]

This should only loads the datasets partially in memory.




回答2:


The file which you want to extend must have the extendable variable with at least one unlimited dimension and reasonable chunk size. Then you can easily append data to this variable and hdf5 file format is actually well suited for such a task. If appending does not work, you probably just need to create a new file, which should not be a problem. Following example will create two files and later merge data from second file to first one. Tested with files > 80G, memory use is not a problem.

import h5py
import numpy as np

ini_dim1 = 100000
ini_dim2 = 1000

counter = int(ini_dim1/10)
dim_extend = int(ini_dim1/counter)

def create_random_dataset(name, dim1, dim2):
    ff1 = h5py.File(name,'w')
    ff1.create_dataset('test_var',(ini_dim1,ini_dim2),maxshape=(None,None),chunks=(10,10))
    for i in range(counter):
        ff1['test_var'][i*dim_extend:(i+1)*dim_extend,:] = np.random.random((dim_extend,ini_dim2))
        ff1.flush()
    ff1.close()

create_random_dataset('test1.h5', ini_dim1, ini_dim2)
create_random_dataset('test2.h5', ini_dim1, ini_dim2)

## append second to first
ff3 = h5py.File('test2.h5','r')
ff4 = h5py.File('test1.h5','a')
print(ff3['test_var'])
print(ff4['test_var'])
ff4['test_var'].resize((ini_dim1*2,ini_dim2))
print(ff4['test_var'])

for i in range(counter):
    ff4['test_var'][ini_dim1+i*dim_extend:ini_dim1 + (i+1)*dim_extend,:] = ff3['test_var'][i*dim_extend:(i+1)*dim_extend,:]
    ff4.flush()
ff3.close()
ff4.close()


来源:https://stackoverflow.com/questions/43929420/how-to-concatenate-two-numpy-arrays-in-hdf5-format

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!