How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the r
I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
ycols <- 1:min(ncol(y),length(key(x)))
yjoin <- unique(y[, ..ycols])
yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
sJ=x[eval(sJ(x,y))],
dolla=unique(x[eval(y$x)]),
brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
expr min lq median uq max neval
# sJ 120.22700 125.04900 126.50704 132.35326 217.6566 100
# dolla 105.05373 108.33804 109.16249 118.17613 285.9814 100
# brack 53.95656 61.32669 61.88227 65.21571 235.8048 100
I'm guessing the [[ vs $ doesn't help the speed, but didn't check.