Can I store my own class object into hdf5?

↘锁芯ラ 提交于 2020-01-11 07:17:13

问题


I have a class like this:

class C:
     def __init__(self, id, user_id, photo):
         self.id = id
         self.user_id = user_id
         self.photo = photo

I need to create millions of these objects. id is an integer as well as user_id but photo is a bool array of size 64. My boss wants me to store all of them inside hdf5 files. I also need to be able to make queries according to their user_id attributes to get all of the photos that have the same user_id. Firstly, how do I store them? Or even can I? And secondly, once I store(if I can) them how do I query them? Thank you.


回答1:


Although you can store the whole data structure in a single HDF5 table, it is probably much easier to store the described class as three separate variables - two 1D arrays of integers and a data structure for storing your 'photo' attribute.

If you care about file size and speed and do not care about human-readability of your files, you can model your 64 bool values either as 8 1D arrays of UINT8 or a 2D array N x 8 of UINT8 (or CHARs). Then, you can implement a simple interface that would pack your bool values into bits of UINT8 and back (e.g., How to convert a boolean array to an int array)

As far as know, there are no built-in search functions in HDF5, but you can read in the variable containing user_ids and then simply use Python to find indexes of all elements matching your user_id.

Once you have the indexes, you can read in the relevant slices of your other variables. HDF5 natively supports efficient slicing, but it works on ranges, so you might want to think how to store records with the same user_id in continuous chunks, see discussion over here

h5py: Correct way to slice array datasets

You might also want to look into pytables - a python interace that builds over hdf5 to store data in table-like strucutres.

import numpy as np
import h5py


class C:
    def __init__(self, id, user_id, photo):
        self.id = id
        self.user_id = user_id
        self.photo = photo

def write_records(records, file_out):

    f = h5py.File(file_out, "w")

    dset_id = f.create_dataset("id", (1000000,), dtype='i')
    dset_user_id = f.create_dataset("user_id", (1000000,), dtype='i')
    dset_photo = f.create_dataset("photo", (1000000,8), dtype='u8')
    dset_id[0:len(records)] = [r.id for r in records]
    dset_user_id[0:len(records)] = [r.user_id for r in records]
    dset_photo[0:len(records)] = [np.packbits(np.array(r.photo, dtype='bool').astype(int)) for r in records]
    f.close()

def read_records_by_id(file_in, record_id):
    f = h5py.File(file_in, "r")
    dset_id = f["id"]
    data = dset_id[0:2]
    res = []
    for idx in np.where(data == record_id)[0]:
        record = C(f["id"][idx:idx+1][0], f["user_id"][idx:idx+1][0], np.unpackbits( np.array(f["photo"][idx:idx+1][0],  dtype='uint8') ).astype(bool))
        res.append(record)
    return res 

m = [ True, False,  True,  True, False,  True,  True,  True]
m = m+m+m+m+m+m+m+m
records = [C(1, 3, m), C(34, 53, m)]

# Write records to file
write_records(records, "mytestfile.h5")

# Read record from file
res = read_records_by_id("mytestfile.h5", 34)

print res[0].id
print res[0].user_id
print res[0].photo


来源:https://stackoverflow.com/questions/39124934/can-i-store-my-own-class-object-into-hdf5

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!