hdf5

Python (pandas): store a data frame in hdf5 with a multi index

∥☆過路亽.° 提交于 2019-12-09 13:05:12
问题 I need to work with large dimension data frame with multi index, so i tried to create a data frame to learn how to store it in an hdf5 file. The data frame is like this: (with the multi index in the first 2 columns) Symbol Date 0 C 2014-07-21 4792 B 2014-07-21 4492 A 2014-07-21 5681 B 2014-07-21 8310 A 2014-07-21 1197 C 2014-07-21 4722 2014-07-21 7695 2014-07-21 1774 I'm using the pandas.to_hdf but it creates a "Fixed format store", when I try to select the datas in a group: store.select(

PyTables read random subset

假装没事ソ 提交于 2019-12-09 11:20:37
问题 Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few thousands for analysis. And what about reading from compressed HDF file? 回答1: Using HDFStore docs are here, compression docs are here Random access via a constructed index is supported in 0.13 In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B']) In [27]: df.to_hdf('test.h5','df',mode='w',format='table') In

Storing scipy sparse matrix as HDF5

倾然丶 夕夏残阳落幕 提交于 2019-12-09 10:55:12
问题 I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I've tried the below code: a = csr_matrix((dat, (row, col)), shape=(947969, 36039)) f = h5py.File('foo.h5','w') dset = f.create_dataset("init", data=a, dtype = int, compression='gzip') I get errors like these, TypeError: Scalar datasets don't support chunk/filter options IOError: Can't prepare for writing data (No appropriate function for conversion path) I can't convert it to numpy array as there will be

Reading and writing reference types using hdf5.net

心不动则不痛 提交于 2019-12-09 09:41:46
问题 I'm using HDF5DotNet to write a generic data logging API, DataLog <T> . The idea is to use reflection to automatically create a H5 compound data type which contains the fields in T . The user can then easily add data to the data log using a write(T[] data) method. In order to automatically create the H5 Types the class or structure must be decorated with [StructLayoutAttribute] and some fields with [MarshalAsAttribute] . Each field is then mapped to a H5 Type and added to the H5 compound data

HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

。_饼干妹妹 提交于 2019-12-09 09:13:25
问题 I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing. Several of the fields in each row contain descriptive strings of variable length. When I do a test run, I create a new DataFrame with a single row in it: def export_as_df(self): return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()]) And then call HDFStore.append(string, DataFrame) to add the new row to the existing DataFrame. This works fine, apart from where

Sparse array support in HDF5

一笑奈何 提交于 2019-12-09 09:07:07
问题 I need to store a 512^3 array on disk in some way and I'm currently using HDF5. Since the array is sparse a lot of disk space gets wasted. Does HDF5 provide any support for sparse array ? 回答1: Chunked datasets (H5D_CHUNKED) allow sparse storage but depending on your data, the overhead may be important. Take a typical array and try both sparse and non-sparse and then compare the file sizes, then you will see if it is really worth. 回答2: One workaround is to create the dataset with a compression

Discovering keys using h5py in python3

本秂侑毒 提交于 2019-12-09 07:59:21
问题 In python2.7 , I can analyze an hdf5 files keys use $ python >>> import h5py >>> f = h5py.File('example.h5', 'r') >>> f.keys() [u'some_key'] However, in python3.4 , I get something different: $ python3 -q >>> import h5py >>> f = h5py.File('example.h5', 'r') >>> f.keys() KeysViewWithLock(<HDF5 file "example.h5" (mode r)>) What is KeysViewWithLock , and how can I examine my HDF5 keys in Python3? 回答1: From h5py's website (http://docs.h5py.org/en/latest/high/group.html#dict-interface-and-links):

What is a better approach of storing and querying a big dataset of meteorological data

烂漫一生 提交于 2019-12-09 06:47:43
问题 I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question. Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo: HDF5 simplifies the file structure to include only two major types of

write a boost::multi_array to hdf5 dataset

眉间皱痕 提交于 2019-12-09 04:46:41
问题 Are there any libraries or headers available to make writing c++ vectors or boost::multi_arrays to HDF5 datasets easy? I have looked at the HDF5 C++ examples and they just use c++ syntax to call c functions, and they only write static c arrays to their datasets (see create.cpp). Am I missing the point!? Many thanks in advance, Adam 回答1: Here is how to write N dimension multi_array s in HDF5 format Here is a short example: #include <boost/multi_array.hpp> using boost::multi_array; using boost:

Experience with using h5py to do analytical work on big data in Python?

风格不统一 提交于 2019-12-09 04:03:47
问题 I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore