问题
I am trying to speed up the sum for several big multilevel dataframes.
Here is a sample:
df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage,
#they can also be mul_df(5000,30,400)
df2, df3, df4 = df1, df1, df1
In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop
I am not satisfy with the 993ms, Is there any way to speed up ? Can cython improve the performance ? If yes, how to write the cython code ? Thanks.
Note:
mul_df()
is the function to create the demo multilevel dataframe.
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
Update:
Data on my Pentium Dual-Core T4200@2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))
In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne
In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1
In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop
In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop
In [9]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop
回答1:
method 1: On my machine not so bad (with numexpr
disabled)
In [41]: from pandas.core import expressions as expr
In [42]: expr.set_use_numexpr(False)
In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop
method 2: Using numexpr
(which is by default enabled if numexpr
is installed)
In [44]: expr.set_use_numexpr(True)
In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop
method 3: Direct use of numexpr
In [34]: import numexpr as ne
In [46]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop
These speedups are achieved using numexpr
because:
- avoids using intermediate temporary arrays (which in the case you are presenting is probably
quite inefficient in numpy, I suspect this is being evaluated like
((df1+df2)+df3)+df4
- uses multi-cores as available
As I hinted above, pandas uses numexpr
under the hood for certain types of ops (in 0.11), e.g. df1 + df2
would be evaluated this way, however the example you are giving here will result in several calls to numexpr
(this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...)
achieves even more speedups.
Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval
which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)
In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop
Lastly to answer your question, cython
will not help here at all; numexpr
is quite efficient at this type of problem (that said, there are situation where cython is helpful)
One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype
来源:https://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum