pytables

Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

孤人 提交于 2019-11-28 14:12:54
I am storing a pandas dataframe as a pytable which contains a MultiIndex. The first level of the MultiIndex is a string corresponding to a userID. Now, most of the userIDs are 13 characters long, but some of them are 15 characters long. When I append a record containing the long userID, pytables raises an error because it is expecting a 13 characters field. ValueError('Trying to store a string with len [15] in [user] column but\nthis column has a limit of [13]!\nConsider using min_itemsize to preset the sizes on these columns',) However, I do not know how to set the attribute min_itemsize for

Unable to save DataFrame to HDF5 (“object header message is too large”)

余生长醉 提交于 2019-11-28 11:30:33
I have a DataFrame in Pandas: In [7]: my_df Out[7]: <class 'pandas.core.frame.DataFrame'> Int64Index: 34 entries, 0 to 0 Columns: 2661 entries, airplane to zoo dtypes: float64(2659), object(2) When I try to save this to disk: store = pd.HDFStore(p_full_h5) store.append('my_df', my_df) I get: File "H5A.c", line 254, in H5Acreate2 unable to create attribute File "H5A.c", line 503, in H5A_create unable to create attribute in object header File "H5Oattribute.c", line 347, in H5O_attr_create unable to create new attribute in header File "H5Omessage.c", line 224, in H5O_msg_append_real unable to

Convert large csv to hdf5

天涯浪子 提交于 2019-11-28 05:52:34
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory. How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple. I was just looking into pytables , but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in

Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)?

老子叫甜甜 提交于 2019-11-28 02:39:29
I am processing large 3D arrays, which I often need to slice in various ways to do a variety of data analysis. A typical "cube" can be ~100GB (and will likely get larger in the future) It seems that the typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them in simple flat binary files? Is HDF5 more appropriate for tabular data, as opposed to large arrays like what I am working with? I see that HDF5 can provide nice

how should i use h5py lib for storing time series data

我的未来我决定 提交于 2019-11-28 02:14:37
i have some time series data that i previously stored as hdf5 files using pytables . I recently tried storing the same with h5py lib. However, since all elements of numpy array have to be of same dtype, I have to convert the date (which is usually the index) into ' float64 ' type before storing it using h5py lib. When i use pytables , the index and its dtype are preserved which makes it possible for me to query the time-series without the need of pulling it all in the memory. I guess with h5py that is not possible. am I missing something here? And if not, under what situations should i use

Merging two tables with millions of rows in Python

只谈情不闲聊 提交于 2019-11-27 19:52:17
I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables. I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on

How to get faster code than numpy.dot for matrix multiplication?

时间秒杀一切 提交于 2019-11-27 12:28:15
Here Matrix multiplication using hdf5 I use hdf5 (pytables) for big matrix multiplication, but I was suprised because using hdf5 it works even faster then using plain numpy.dot and store matrices in RAM, what is the reason of this behavior? And maybe there is some faster function for matrix multiplication in python, because I still use numpy.dot for small block matrix multiplication. here is some code: Assume matrices can fit in RAM: test on matrix 10*1000 x 1000. Using default numpy(I think no BLAS lib). Plain numpy arrays are in RAM: time 9.48 If A,B in RAM, C on disk: time 1.48 If A,B,C on

Python: how to store a numpy multidimensional array in PyTables?

半腔热情 提交于 2019-11-27 11:50:25
How can I put a numpy multidimensional array in a HDF5 file using PyTables? From what I can tell I can't put an array field in a pytables table. I also need to store some info about this array and be able to do mathematical computations on it. Any suggestions? Joe Kington There may be a simpler way, but this is how you'd go about doing it, as far as I know: import numpy as np import tables # Generate some data x = np.random.random((100,100,100)) # Store "x" in a chunked array... f = tables.open_file('test.hdf', 'w') atom = tables.Atom.from_dtype(x.dtype) ds = f.createCArray(f.root, 'somename',

Improve pandas (PyTables?) HDF5 table write performance

ぃ、小莉子 提交于 2019-11-27 09:29:24
问题 I've been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface) does a tremendous job of allowing me to process heterogenous data using all the Python tools I know and love. Generally speaking, I use the Fixed (formerly "Storer") format in PyTables, as my workflow is write-once, read-many, and many of my datasets are sized such that I can load 50-100 of them into memory at a time with

Issue for insert using psycopg

不打扰是莪最后的温柔 提交于 2019-11-27 08:46:57
问题 I am reading data from a .mat file using the Pytables module. After reading the data, I want to insert this data into the database using psycopg. Here is a sample code piece: file = tables.openFile(matFile) x = 0 #populate the matData list for var in dest: data = file.getNode('/' + var)[:] matData.append(data) x = x+1 #insert into db for i in range(0,x): cur.execute("""INSERT INTO \"%s\" (%s) VALUES (%s)""" % tableName,dest[i],matData[i]) ) I am getting the following error: Traceback (most