Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

后端未结

关注

 4  1027

孤城傲影 2020-12-24 14:20

I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   离开以前
                                             
                
                
                (楼主)
            
              
              
                2020-12-24 15:06
              

            
            
                        
So, I finally got around to running the expressions in OP's question:

set.seed(0)
dummy_data <- dplyr::data_frame(
  id=floor(runif(100000, 1, 100000))
  , label=floor(runif(100000, 1, 4))
)

microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))


This took so long that I thought I'd run into a bug, and force-interrupted R.

Trying again, with the number of repetitions cut down, I got the following times:

microbenchmark(
    b=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
    d=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
    times=2)

#Unit: seconds
# expr      min       lq     mean   median       uq      max neval
#    b 2.091957 2.091957 2.162222 2.162222 2.232486 2.232486     2
#    d 7.380610 7.380610 7.459041 7.459041 7.537471 7.537471     2


The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of times=100.

But why is it taking so long? First, the way the dataset is constructed, the id column contains about 63000 values: 

length(unique(dummy_data$id))
#[1] 63052


Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.

This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.

By contrast, if we just switch the variables that are being grouped and summarized over:

microbenchmark(
    b=dummy_data %>% group_by(label) %>% summarise(list(unique(id))),
    d=dummy_data %>% group_by(label) %>% summarise(id %>% unique %>% list),
    times=2)

#Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
#    b 12.00079 12.00079 12.04227 12.04227 12.08375 12.08375     2
#    d 10.16612 10.16612 12.68642 12.68642 15.20672 15.20672     2


Now everything looks much more equal.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复