I thought that generally speaking using %>%
wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.
library(dplyr
So, I finally got around to running the expressions in OP's question:
set.seed(0)
dummy_data <- dplyr::data_frame(
id=floor(runif(100000, 1, 100000))
, label=floor(runif(100000, 1, 4))
)
microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))
This took so long that I thought I'd run into a bug, and force-interrupted R.
Trying again, with the number of repetitions cut down, I got the following times:
microbenchmark(
b=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
d=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
times=2)
#Unit: seconds
# expr min lq mean median uq max neval
# b 2.091957 2.091957 2.162222 2.162222 2.232486 2.232486 2
# d 7.380610 7.380610 7.459041 7.459041 7.537471 7.537471 2
The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of times=100
.
But why is it taking so long? First, the way the dataset is constructed, the id
column contains about 63000 values:
length(unique(dummy_data$id))
#[1] 63052
Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.
This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.
By contrast, if we just switch the variables that are being grouped and summarized over:
microbenchmark(
b=dummy_data %>% group_by(label) %>% summarise(list(unique(id))),
d=dummy_data %>% group_by(label) %>% summarise(id %>% unique %>% list),
times=2)
#Unit: milliseconds
# expr min lq mean median uq max neval
# b 12.00079 12.00079 12.04227 12.04227 12.08375 12.08375 2
# d 10.16612 10.16612 12.68642 12.68642 15.20672 15.20672 2
Now everything looks much more equal.