pytables | 易学教程

Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

阅读更多关于 Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

问题 I am storing a pandas dataframe as a pytable which contains a MultiIndex. The first level of the MultiIndex is a string corresponding to a userID. Now, most of the userIDs are 13 characters long, but some of them are 15 characters long. When I append a record containing the long userID, pytables raises an error because it is expecting a 13 characters field. ValueError('Trying to store a string with len [15] in [user] column but\nthis column has a limit of [13]!\nConsider using min_itemsize to

Pandas “Group By” Query on Large Data in HDFStore?

阅读更多关于 Pandas “Group By” Query on Large Data in HDFStore?

I have about 7 million rows in an HDFStore with more than 60 columns. The data is more than I can fit into memory. I'm looking to aggregate the data into groups based on the value of a column "A". The documentation for pandas splitting/aggregating/combining assumes that I have all my data in a DataFrame already, however I can't read the entire store into an in-memory DataFrame . What is the correct approach for grouping data in an HDFStore ? Heres a complete example. import numpy as np import pandas as pd import os fname = 'groupby.h5' # create a frame df = pd.DataFrame({'A': ['foo', 'foo',

Convert large csv to hdf5

阅读更多关于 Convert large csv to hdf5

问题 I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory. How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple. I was just looking into pytables , but it doesn't look like the array class (which corresponds to

Storing numpy sparse matrix in HDF5 (PyTables)

阅读更多关于 Storing numpy sparse matrix in HDF5 (PyTables)

I am having trouble storing a numpy csr_matrix with PyTables. I'm getting this error: TypeError: objects of type ``csr_matrix`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or string My code: f = tables.openFile(path,'w') atom = tables.Atom.from_dtype(self.count_vector.dtype) ds = f.createCArray(f.root, 'count', atom, self.count_vector.shape) ds[:] = self.count_vector f.close() Any ideas? Thanks A CSR matrix can be fully reconstructed from its data , indices and indptr attributes. These are

Unable to save DataFrame to HDF5 (“object header message is too large”)

阅读更多关于 Unable to save DataFrame to HDF5 (“object header message is too large”)

问题 I have a DataFrame in Pandas: In [7]: my_df Out[7]: <class 'pandas.core.frame.DataFrame'> Int64Index: 34 entries, 0 to 0 Columns: 2661 entries, airplane to zoo dtypes: float64(2659), object(2) When I try to save this to disk: store = pd.HDFStore(p_full_h5) store.append('my_df', my_df) I get: File "H5A.c", line 254, in H5Acreate2 unable to create attribute File "H5A.c", line 503, in H5A_create unable to create attribute in object header File "H5Oattribute.c", line 347, in H5O_attr_create

Merging two tables with millions of rows in Python

阅读更多关于 Merging two tables with millions of rows in Python

问题 I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables. I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting

HDF5 taking more space than CSV?

阅读更多关于 HDF5 taking more space than CSV?

问题 Consider the following example: Prepare the data: import string import random import pandas as pd matrix = np.random.random((100, 3000)) my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])] mydf = pd.DataFrame(matrix, columns=my_cols) mydf['something'] = 'hello_world' Set the highest compression possible for HDF5: store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close() Save also to CSV: mydf.to_csv('myfile.csv', sep=':')

Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)?

阅读更多关于 Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)?

问题 I am processing large 3D arrays, which I often need to slice in various ways to do a variety of data analysis. A typical "cube" can be ~100GB (and will likely get larger in the future) It seems that the typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them in simple flat binary files? Is HDF5 more appropriate for tabular

Pandas “Group By” Query on Large Data in HDFStore?

阅读更多关于 Pandas “Group By” Query on Large Data in HDFStore?

问题 I have about 7 million rows in an HDFStore with more than 60 columns. The data is more than I can fit into memory. I\'m looking to aggregate the data into groups based on the value of a column \"A\". The documentation for pandas splitting/aggregating/combining assumes that I have all my data in a DataFrame already, however I can\'t read the entire store into an in-memory DataFrame . What is the correct approach for grouping data in an HDFStore ? 回答1: Heres a complete example. import numpy as

Storing numpy sparse matrix in HDF5 (PyTables)

阅读更多关于 Storing numpy sparse matrix in HDF5 (PyTables)

问题 I am having trouble storing a numpy csr_matrix with PyTables. I\'m getting this error: TypeError: objects of type ``csr_matrix`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or string My code: f = tables.openFile(path,\'w\') atom = tables.Atom.from_dtype(self.count_vector.dtype) ds = f.createCArray(f.root, \'count\', atom, self.count_vector.shape) ds[:] = self.count_vector f.close() Any ideas