How to speed up Pandas multilevel dataframe sum?

前端未结

关注

 2  619

情书的邮戳 2021-01-07 12:44

I am trying to speed up the sum for several big multilevel dataframes.

Here is a sample:

df1 = mul_df(5000,30,400) # mul_df to create a big multileve


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   盖世英雄少女心
                                             
                
                
                (楼主)
            
              
              
                2021-01-07 13:03
              

            
            
                        
method 1: On my machine not so bad (with numexpr disabled)

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop


method 2: Using numexpr (which is by default enabled if numexpr is installed)

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop


method 3: Direct use of numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop


These speedups are achieved using numexpr because:


avoids using intermediate temporary arrays (which in the case you are presenting is probably
 quite inefficient in numpy, I suspect this is being evaluated like ((df1+df2)+df3)+df4
uses multi-cores as available


As I hinted above, pandas uses numexpr under the hood for certain types of ops (in 0.11), e.g. df1 + df2 would be evaluated this way, however the example you are giving here will result in several calls to numexpr (this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...) achieves even more speedups.

Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop


Lastly to answer your question, cython will not help here at all; numexpr is quite efficient at this type of problem (that said, there are situation where cython is helpful)

One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复