How to write data to a compound data using h5py?

a 夏天 提交于 2021-02-10 23:26:54


I know that in c we can construct a compound dataset easily using struct type and assign data chunk by chunk. I am currently implementing a similar structure in Python with h5py.

import h5py
import numpy as np 

# we create a h5 file 
f = h5py.File("test.h5") # default is mode "a"

# We define a compound datatype using np.dtype
dt_type = np.dtype({"names":["image","feature"],

# we define our dataset with 5 instances
a = f.create_dataset("test", shape=(5,), dtype=dt_type)

To write data, we can do this...

# "feature" array is 1D

output is

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

# Write 1s to data field "feature"
a["feature"] = np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

The problem is when I wrote 2D array "image" into file.

a["image"] = np.ones((5,4,4))

ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

I read the documentation and did research. Unfortunately, I did not find a good solution. I understand that we apply group/dataset to mimic this compound data but I really want to keep this structure. Is there a good way to do this?

Any help would be appreciated. Thank you.


You can use PyTables (aka tables) to populate your HDF5 file with the desired arrays. You should think of each row as an independent entry (defined by a dtype). So, the 'image' array is stored as 5 (4x4) ndarrays, not a single (5x4x4) ndarray. The same goes for the 'feature' array.

This example adds each 'feature' and 'image' array one row at a time. Alternately, you can create a numpy record array with both arrays with data for multiple rows, then add with a Table.append() function.

See code below to create the file, then open read only to check the data.

import tables as tb
import numpy as np 

# open h5 file for writing
with tb.File('test1_tb.h5','w') as h5f:

# define a compound datatype using np.dtype
    dt_type = np.dtype({"names":["feature","image"],
                        "formats":[('<f4',(10,)) , ('<f4',(4,4)) ] })

# create empty table (dataset)
    a = h5f.create_table('/', "test1", description=dt_type)

# create dataset row interator
    a_row = a.row
# create array data and append to dataset
    for i in range(5):
        a_row['feature'] = i*np.ones(10)
        a_row['image'] = np.random.random(4*4).reshape(4,4)


# open h5 file read only and print contents
with tb.File('test1_tb.h5','r') as h5fr:
    a = h5fr.get_node('/','test1')
    print (a.coldtypes)
    print ('# of rows:',a.nrows)

    for row in a:
        print (row['feature'])
        print (row['image'])


This blogpost has helped me with this issue:

The key code for writing a compound dataset:

import numpy as np
import h5py

# Load your dataset into numpy
audio = np.load(path.join(root_dir, 'X_dev.npy')).astype(np.float32)
text = np.load(path.join(root_dir, 'T_dev.npy')).astype(np.float32)
gesture = np.load(path.join(root_dir, 'Y_dev.npy')).astype(np.float32)

# open a hdf5 file
hf = h5py.File(root_dir+"/dev.hdf5", 'a') 

# create group
g1 = hf.create_group('dev') 

# put dataset in subgroups
g1.create_dataset('audio', data=audio)
g1.create_dataset('text', data=text)
g1.create_dataset('gesture', data=gesture)

# close the hdf5 file

