Pandas replace/dictionary slowness

前端未结

关注

 2  1984

Please help me understand why this \"replace from dictionary\" operation is slow in Python/Pandas:

# Series has 200 rows and 1 column
# Dictionary has 11269


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-12-14 04:40
              
            
            
                                                                       
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:

series = series.map(lambda x: dictionary.get(x,x))


If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function.  Any keys that are not present will return NaN via this method, so beware:

series = series.map(dictionary.get)


You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:

series = series.map(dictionary)


Timings

Some timing comparisons using your example data:

%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop

%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop

%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop

%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-14 04:45
              
            
            
                                                                       
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复