I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.
library(dplyr
What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:
> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
expr min lq mean median uq max neval
z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735 100
list(unique(z)) 9.289 9.988 10.85705 10.5820 11.804 12.642 100
This is doing something a bit different to your code but illustrates the point. Pipes are slower.
Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.
Profiling expressions like this showed me most of the time is spent in the pipe functions:
total.time total.pct self.time self.pct
"microbenchmark" 16.84 98.71 1.22 7.15
"%>%" 15.50 90.86 1.22 7.15
"eval" 5.72 33.53 1.18 6.92
"split_chain" 5.60 32.83 1.92 11.25
"lapply" 5.00 29.31 0.62 3.63
"FUN" 4.30 25.21 0.24 1.41
..... stuff .....
then somewhere down in about 15th place the real work gets done:
"as.list" 1.40 8.13 0.66 3.83
"unique" 1.38 8.01 0.88 5.11
"rev" 1.26 7.32 0.90 5.23
Whereas if you just call the functions as Chambers intended, R gets straight down to it:
total.time total.pct self.time self.pct
"microbenchmark" 2.30 96.64 1.04 43.70
"unique" 1.12 47.06 0.38 15.97
"unique.default" 0.74 31.09 0.64 26.89
"is.factor" 0.10 4.20 0.10 4.20
Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....