hdf5 | 易学教程

Python (pandas): store a data frame in hdf5 with a multi index

阅读更多关于 Python (pandas): store a data frame in hdf5 with a multi index

问题 I need to work with large dimension data frame with multi index, so i tried to create a data frame to learn how to store it in an hdf5 file. The data frame is like this: (with the multi index in the first 2 columns) Symbol Date 0 C 2014-07-21 4792 B 2014-07-21 4492 A 2014-07-21 5681 B 2014-07-21 8310 A 2014-07-21 1197 C 2014-07-21 4722 2014-07-21 7695 2014-07-21 1774 I'm using the pandas.to_hdf but it creates a "Fixed format store", when I try to select the datas in a group: store.select(

PyTables read random subset

阅读更多关于 PyTables read random subset

问题 Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few thousands for analysis. And what about reading from compressed HDF file? 回答1: Using HDFStore docs are here, compression docs are here Random access via a constructed index is supported in 0.13 In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B']) In [27]: df.to_hdf('test.h5','df',mode='w',format='table') In

Storing scipy sparse matrix as HDF5

阅读更多关于 Storing scipy sparse matrix as HDF5

问题 I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I've tried the below code: a = csr_matrix((dat, (row, col)), shape=(947969, 36039)) f = h5py.File('foo.h5','w') dset = f.create_dataset("init", data=a, dtype = int, compression='gzip') I get errors like these, TypeError: Scalar datasets don't support chunk/filter options IOError: Can't prepare for writing data (No appropriate function for conversion path) I can't convert it to numpy array as there will be

Reading and writing reference types using hdf5.net

阅读更多关于 Reading and writing reference types using hdf5.net

问题 I'm using HDF5DotNet to write a generic data logging API, DataLog <T> . The idea is to use reflection to automatically create a H5 compound data type which contains the fields in T . The user can then easily add data to the data log using a write(T[] data) method. In order to automatically create the H5 Types the class or structure must be decorated with [StructLayoutAttribute] and some fields with [MarshalAsAttribute] . Each field is then mapped to a H5 Type and added to the H5 compound data

HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

阅读更多关于 HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

问题 I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing. Several of the fields in each row contain descriptive strings of variable length. When I do a test run, I create a new DataFrame with a single row in it: def export_as_df(self): return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()]) And then call HDFStore.append(string, DataFrame) to add the new row to the existing DataFrame. This works fine, apart from where

Sparse array support in HDF5

阅读更多关于 Sparse array support in HDF5

问题 I need to store a 512^3 array on disk in some way and I'm currently using HDF5. Since the array is sparse a lot of disk space gets wasted. Does HDF5 provide any support for sparse array ? 回答1: Chunked datasets (H5D_CHUNKED) allow sparse storage but depending on your data, the overhead may be important. Take a typical array and try both sparse and non-sparse and then compare the file sizes, then you will see if it is really worth. 回答2: One workaround is to create the dataset with a compression

Discovering keys using h5py in python3

阅读更多关于 Discovering keys using h5py in python3

问题 In python2.7 , I can analyze an hdf5 files keys use $ python >>> import h5py >>> f = h5py.File('example.h5', 'r') >>> f.keys() [u'some_key'] However, in python3.4 , I get something different: $ python3 -q >>> import h5py >>> f = h5py.File('example.h5', 'r') >>> f.keys() KeysViewWithLock(<HDF5 file "example.h5" (mode r)>) What is KeysViewWithLock , and how can I examine my HDF5 keys in Python3? 回答1: From h5py's website (http://docs.h5py.org/en/latest/high/group.html#dict-interface-and-links):

What is a better approach of storing and querying a big dataset of meteorological data

阅读更多关于 What is a better approach of storing and querying a big dataset of meteorological data

问题 I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question. Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo: HDF5 simplifies the file structure to include only two major types of

write a boost::multi_array to hdf5 dataset

阅读更多关于 write a boost::multi_array to hdf5 dataset

问题 Are there any libraries or headers available to make writing c++ vectors or boost::multi_arrays to HDF5 datasets easy? I have looked at the HDF5 C++ examples and they just use c++ syntax to call c functions, and they only write static c arrays to their datasets (see create.cpp). Am I missing the point!? Many thanks in advance, Adam 回答1: Here is how to write N dimension multi_array s in HDF5 format Here is a short example: #include <boost/multi_array.hpp> using boost::multi_array; using boost:

Experience with using h5py to do analytical work on big data in Python?

阅读更多关于 Experience with using h5py to do analytical work on big data in Python?

问题 I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore