Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

后端 未结 4 1035
孤城傲影
孤城傲影 2020-12-24 14:20

I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr         


        
4条回答
  •  误落风尘
    2020-12-24 14:50

    What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

    > set.seed(99);z=sample(10000,4,TRUE)
    > microbenchmark(z %>% unique %>% list, list(unique(z)))
    Unit: microseconds
                      expr     min      lq      mean   median      uq     max neval
     z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735   100
           list(unique(z))   9.289   9.988  10.85705  10.5820  11.804  12.642   100
    

    This is doing something a bit different to your code but illustrates the point. Pipes are slower.

    Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

    Profiling expressions like this showed me most of the time is spent in the pipe functions:

                             total.time total.pct self.time self.pct
    "microbenchmark"              16.84     98.71      1.22     7.15
    "%>%"                         15.50     90.86      1.22     7.15
    "eval"                         5.72     33.53      1.18     6.92
    "split_chain"                  5.60     32.83      1.92    11.25
    "lapply"                       5.00     29.31      0.62     3.63
    "FUN"                          4.30     25.21      0.24     1.41
     ..... stuff .....
    

    then somewhere down in about 15th place the real work gets done:

    "as.list"                      1.40      8.13      0.66     3.83
    "unique"                       1.38      8.01      0.88     5.11
    "rev"                          1.26      7.32      0.90     5.23
    

    Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                             total.time total.pct self.time self.pct
    "microbenchmark"               2.30     96.64      1.04    43.70
    "unique"                       1.12     47.06      0.38    15.97
    "unique.default"               0.74     31.09      0.64    26.89
    "is.factor"                    0.10      4.20      0.10     4.20
    

    Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

提交回复
热议问题