Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

后端未结

关注

 4  1039

孤城傲影 2020-12-24 14:20

I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   清歌不尽
                                             
                
                
                (楼主)
            
              
              
                2020-12-24 15:08
              

            
            
                        
magrittr's pipe is coded around the concept of functional chain.

You can create one by starting with a dot : . %>% head() %>% dim(), it's a compact way of writing a function.

When using a standard pipe call such as iris %>% head() %>% dim(), the functional chain  . %>% head() %>% dim() will still be computed first, causing an overhead.

The functional chain is a bit of a strange animal : 

(. %>% head()) %>% dim
#> NULL


When you look at the call . %>% head() %>% dim() , it actually parses as `%>%`( `%>%`(., head()), dim()). Basically, sorting things out requires some manipulation that takes a bit of time.

Another thing that takes a bit of time is to handle the different cases of rhs such as in iris %>% head, iris %>% head(.), iris %>% {head(.)} etc, to insert a dot at the right place when relevant.

You can build a very fast pipe the following way : 

`%.%` <- function (lhs, rhs) {
    rhs_call <- substitute(rhs)
    eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
}


It will be much faster than magrittr's pipe and will actually behave better with edge cases, but will require explicit dots and obviously won't support functional chains.

library(magrittr)
`%.%` <- function (lhs, rhs) {
  rhs_call <- substitute(rhs)
  eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
}
bench::mark(relative = T,
  "%>%" =
    1 %>% identity %>% identity() %>% (identity) %>% {identity(.)},
  "%.%" = 
    1 %.% identity(.) %.% identity(.) %.% identity(.) %.% identity(.)
)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>                    
#> 1 %>%         15.9   13.3       1        4.75     1   
#> 2 %.%          1      1        17.0      1        1.60


^{Created on 2019-10-05 by the reprex package (v0.3.0)}

Here it was clocked at being 13. times faster.

I included it in my experimental fastpipe package, named as %>>%.

Now, we can also leverage the power of functional chains directly with a simple change to your call :

dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list)


It will be much faster because the functional chain is only parsed once and then internally it just applies functions one after another in a loop, very close to your base solution. My fast pipe on the other hand still adds a small overhead due to the eval / substitute done for every loop instance and every pipe.

Here's a benchmark including those 2 new solutions : 

microbenchmark::microbenchmark(
  nopipe=dummy_data %>% group_by(id) %>% summarise(label = list(unique(label))),
  magrittr=dummy_data %>% group_by(id) %>% summarise(label = label %>% unique %>% list),
  functional_chain=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
  fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label =label %.% unique(.) %.% list(.)),
  times = 10
)

#> Unit: milliseconds
#>              expr      min       lq     mean    median       uq      max neval cld
#>            nopipe  42.2388  42.9189  58.0272  56.34325  66.1304  80.5491    10  a 
#>          magrittr 512.5352 571.9309 625.5392 616.60310 670.3800 811.1078    10   b
#>  functional_chain  64.3320  78.1957 101.0012  99.73850 126.6302 148.7871    10  a 
#>          fastpipe  66.0634  87.0410 101.9038  98.16985 112.7027 172.1843    10  a

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复