问题
There are some answers on SO where timings are compared without checking the results. However, I prefer to see whether an expression is correct and fast.
The microbenchmark
package supports this with the check
parameter. Unfortunately, the check fails on expressions which change a data.table
by reference, i.e., the check does not recognize that results are different.
Case 1: data.table expressions where check works as expected
library(data.table)
library(microbenchmark)
# minimal data.table 1 col, 3 rows
dt <- data.table(x = c(1, 1, 10))
# define check function as in example section of help(microbenchmark)
my_check <- function(values) {
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
The benchmark cases are designed to return different results. Thus,
microbenchmark(
f1 = dt[, mean(x)],
f2 = dt[, median(x)],
check = my_check
)
returns an error message as expected:
Error: Input expressions are not equivalent.
Case 2: data.table expressions where check fails
Now, the expressions are modified to change dt
by reference. Please, note that the same check function is used.
microbenchmark(
f1 = dt[, y := mean(x)],
f2 = dt[, y := median(x)],
check = my_check
)
returns now
expr min lq mean median uq max neval cld
f1 576.947 625.174 642.9820 640.7110 661.1870 732.391 100 a
f2 602.022 658.384 684.7076 678.9975 694.0825 978.600 100 b
So, the check on the results has failed here although the two expressions are different. (Timings are irrelevant.)
I understand that the check is determined to fail because dt
is changed by reference. So, when comparing the result of each expression always the same object is referenced in the state of the last change.
Question
How can I modify the check function and/or the expressions so that the check reliably will detect differing results even in case of a data.table
being changed by reference?
回答1:
The simplest way is to use copy()
:
microbenchmark(
f1 = copy(dt)[, y := mean(x)],
f2 = copy(dt)[, y := median(x)],
check = my_check, times=1L
)
# Error: Input expressions are not equivalent.
Adding copy(dt)
to the mix would give an idea on the time spent on copying (and if necessary, one could always subtract that from the runtimes for f1
and f2
).
microbenchmark(
f1 = copy(dt)[, y := mean(x)],
f2 = copy(dt)[, y := median(x)],
f3 = copy(dt),
times=10L
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# f1 298.690 306.508 331.6364 315.1400 347.788 414.264 10 b
# f2 319.075 322.475 373.3873 329.3895 336.268 746.134 10 b
# f3 19.180 19.750 28.3504 25.1745 26.111 70.016 10 a
来源:https://stackoverflow.com/questions/38716772/check-of-microbenchmark-results-fails-with-data-table-changed-by-reference