Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

后端 未结 4 1039
孤城傲影
孤城傲影 2020-12-24 14:20

I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr         


        
4条回答
  •  清歌不尽
    2020-12-24 15:08

    magrittr's pipe is coded around the concept of functional chain.

    You can create one by starting with a dot : . %>% head() %>% dim(), it's a compact way of writing a function.

    When using a standard pipe call such as iris %>% head() %>% dim(), the functional chain . %>% head() %>% dim() will still be computed first, causing an overhead.

    The functional chain is a bit of a strange animal :

    (. %>% head()) %>% dim
    #> NULL
    

    When you look at the call . %>% head() %>% dim() , it actually parses as `%>%`( `%>%`(., head()), dim()). Basically, sorting things out requires some manipulation that takes a bit of time.

    Another thing that takes a bit of time is to handle the different cases of rhs such as in iris %>% head, iris %>% head(.), iris %>% {head(.)} etc, to insert a dot at the right place when relevant.

    You can build a very fast pipe the following way :

    `%.%` <- function (lhs, rhs) {
        rhs_call <- substitute(rhs)
        eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
    }
    

    It will be much faster than magrittr's pipe and will actually behave better with edge cases, but will require explicit dots and obviously won't support functional chains.

    library(magrittr)
    `%.%` <- function (lhs, rhs) {
      rhs_call <- substitute(rhs)
      eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
    }
    bench::mark(relative = T,
      "%>%" =
        1 %>% identity %>% identity() %>% (identity) %>% {identity(.)},
      "%.%" = 
        1 %.% identity(.) %.% identity(.) %.% identity(.) %.% identity(.)
    )
    #> # A tibble: 2 x 6
    #>   expression   min median `itr/sec` mem_alloc `gc/sec`
    #>                    
    #> 1 %>%         15.9   13.3       1        4.75     1   
    #> 2 %.%          1      1        17.0      1        1.60
    

    Created on 2019-10-05 by the reprex package (v0.3.0)

    Here it was clocked at being 13. times faster.

    I included it in my experimental fastpipe package, named as %>>%.

    Now, we can also leverage the power of functional chains directly with a simple change to your call :

    dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list)
    

    It will be much faster because the functional chain is only parsed once and then internally it just applies functions one after another in a loop, very close to your base solution. My fast pipe on the other hand still adds a small overhead due to the eval / substitute done for every loop instance and every pipe.

    Here's a benchmark including those 2 new solutions :

    microbenchmark::microbenchmark(
      nopipe=dummy_data %>% group_by(id) %>% summarise(label = list(unique(label))),
      magrittr=dummy_data %>% group_by(id) %>% summarise(label = label %>% unique %>% list),
      functional_chain=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
      fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label =label %.% unique(.) %.% list(.)),
      times = 10
    )
    
    #> Unit: milliseconds
    #>              expr      min       lq     mean    median       uq      max neval cld
    #>            nopipe  42.2388  42.9189  58.0272  56.34325  66.1304  80.5491    10  a 
    #>          magrittr 512.5352 571.9309 625.5392 616.60310 670.3800 811.1078    10   b
    #>  functional_chain  64.3320  78.1957 101.0012  99.73850 126.6302 148.7871    10  a 
    #>          fastpipe  66.0634  87.0410 101.9038  98.16985 112.7027 172.1843    10  a
    

提交回复
热议问题