问题

There are some answers on SO where timings are compared without checking the results. However, I prefer to see whether an expression is correct and fast.

The microbenchmark package supports this with the check parameter. Unfortunately, the check fails on expressions which change a data.table by reference, i.e., the check does not recognize that results are different.

Case 1: data.table expressions where check works as expected

library(data.table)
library(microbenchmark)

# minimal data.table 1 col, 3 rows
dt <- data.table(x = c(1, 1, 10))

# define check function as in example section of help(microbenchmark)
my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}

The benchmark cases are designed to return different results. Thus,

microbenchmark(
  f1 = dt[, mean(x)],
  f2 = dt[, median(x)],
  check = my_check
)

returns an error message as expected:

Error: Input expressions are not equivalent.

Case 2: data.table expressions where check fails

Now, the expressions are modified to change dt by reference. Please, note that the same check function is used.

microbenchmark(
  f1 = dt[, y := mean(x)],
  f2 = dt[, y := median(x)],
  check = my_check
)

returns now

 expr     min      lq     mean   median       uq     max neval cld
   f1 576.947 625.174 642.9820 640.7110 661.1870 732.391   100  a 
   f2 602.022 658.384 684.7076 678.9975 694.0825 978.600   100   b

So, the check on the results has failed here although the two expressions are different. (Timings are irrelevant.)

I understand that the check is determined to fail because dt is changed by reference. So, when comparing the result of each expression always the same object is referenced in the state of the last change.

Question

How can I modify the check function and/or the expressions so that the check reliably will detect differing results even in case of a data.table being changed by reference?

回答1:

The simplest way is to use copy():

microbenchmark(
    f1 = copy(dt)[, y := mean(x)],
    f2 = copy(dt)[, y := median(x)],
    check = my_check, times=1L
)
# Error: Input expressions are not equivalent.

Adding copy(dt) to the mix would give an idea on the time spent on copying (and if necessary, one could always subtract that from the runtimes for f1 and f2).

microbenchmark(
    f1 = copy(dt)[, y := mean(x)],
    f2 = copy(dt)[, y := median(x)],
    f3 = copy(dt),
    times=10L
)
# Unit: microseconds
#  expr     min      lq     mean   median      uq     max neval cld
#    f1 298.690 306.508 331.6364 315.1400 347.788 414.264    10   b
#    f2 319.075 322.475 373.3873 329.3895 336.268 746.134    10   b
#    f3  19.180  19.750  28.3504  25.1745  26.111  70.016    10   a

来源：https://stackoverflow.com/questions/38716772/check-of-microbenchmark-results-fails-with-data-table-changed-by-reference

标签

data.table

microbenchmark

Check of microbenchmark results fails with data.table changed by reference

问题

Case 1: data.table expressions where check works as expected

Case 2: data.table expressions where check fails

Question

回答1: