Improve pandas (PyTables?) HDF5 table write performance

前端未结

关注

 2  1880

I\'ve been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface

相关标签:

2条回答

花落未央

2020-12-12 16:59
Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB

I define 3 timing functions:

Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format
```
def f(df):
    store = pd.HDFStore('test.h5','w')
    store['df'] = df
    store.close()
```
Write in the Table format, using PyTables Table format. Do not create an index.
```
def f2(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df,index=False)
    store.close()
```
Same as f2, but create an index (which is normally done)
```
def f3(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df)
    store.close()
```
Create the frame
```
In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)

In [26]: df
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 4
dtypes: float64(10), int64(5)


v0.12.0

In [27]: %timeit f(df)
1 loops, best of 3: 14.7 s per loop

In [28]: %timeit f2(df)
1 loops, best of 3: 32 s per loop

In [29]: %timeit f3(df)
1 loops, best of 3: 40.1 s per loop

master/v0.13.0

In [5]: %timeit f(df)
1 loops, best of 3: 12.9 s per loop

In [6]: %timeit f2(df)
1 loops, best of 3: 17.5 s per loop

In [7]: %timeit f3(df)
1 loops, best of 3: 24.3 s per loop
```
Timing Runs with the same file as provided by the OP (link is below)
```
In [4]: df = pd.read_hdf('test.h5','df')

In [5]: df
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28880943 entries, 0 to 28880942
Columns: 14 entries, node_id to kernel_type
dtypes: float64(4), int64(10)
```
Like f1, Fixed format
```
In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')
1 loops, best of 3: 36.2 s per loop
```
Like f2, Table format, no index
```
In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)
1 loops, best of 3: 45 s per loop

In [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)
1 loops, best of 3: 44.5 s per loop
```
Like f3, Table format with index
```
In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)
1 loops, best of 3: 1min 36s per loop
```
Like f3, Table format with index, compressed with blosc
```
In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')
1 loops, best of 3: 46.5 s per loop

In [11]: %timeit pd.read_hdf('test.hdf','df')
1 loops, best of 3: 10.8 s per loop
```
Show original file (test.h5, and compressed, test.hdf)
```
In [13]: !ls -ltr test.h*
-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5
-rw-rw-r-- 1 jreback users  649327780 Nov 20 21:17 test.hdf
```
Several points to note.
- Not creating an index can make a non-trivial difference in time. I also believe that if you have a string based index it can substantially worsen write time. That said, you always want to create an index to make retrieval very fast.
  
  You didn't include what your index is, nor whether its sorted (though I only think this makes a small difference).
- the write penalty in my examples are roughly 2x (though I have seen it be somewhat bigger when INCLUDING the index time). Thus your 7s (1/2 of my time), for 3x the number I am writing is quite suspect. I am using a reasonably fast disk array. If you were using a flash based disk then this is possible, though.
- master/v0.13.0 (release very soon), improves the write times on tables substantially.
- you can try setting the chunksize parameter to a bigger number when you write the data (its default is 100000). The purpose of the 'relatively' low number is to have a constant memory usage. (e.g. if is bigger you will use more memory, in theory it should write faster though).
- Tables offer 2 advantages over Fixed format: 1) query retrieval, and 2) appendability. Reading the entire table doesn't take advantage of either, so if you ONLY want to read the entire table, then Fixed format is recommended. (In my experience the flexibility of Tables greatly outweights the write penalty, but YMMV)
Bottom line is to repeat the timings (use ipython as it will run multiple tests). If you can reproduce your results, then pls post a %prun and I'll take a look.

Update:

so the recommended way for a table this size is to compress with blosc and use pandas master/0.13.0 along with PyTables 3.0.0
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-12-12 17:22

That's an interesting discussion. I think Peter is getting awesome performance for the Fixed format because the format writes in a single shot and also that he has a really good SSD (it can write at more than 450 MB/s).

Appending to table is a more complex operation (the dataset has to be enlarged, and new records must be checked so that we can ensure that they follow the schema of the table). This is why appending rows in tables is generally slower (but still, Jeff is getting ~ 70 MB/s, which is pretty good). That Jeff is getting more speed than Peter is probably due to the fact that he has a better processor.

Finally, indexing in PyTables uses a single processor, yes, and that normally is an expensive operation, so you should really disable it if you are not going to query data on-disk.

0 讨论(0)
发布评论:

提交评论
- 加载中...