How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the r
Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:
x[eval(unique(y[, key(x), with = FALSE]))]
Here is another, more direct solution:
unique(x[eval(y$x)])
It's more direct and runs faster - here is the comparison in run times with my previous solution:
# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE),  z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
system.time(r1 <- x[!eval(x[!y])])
   user  system elapsed 
  7.772   1.217  11.998 
system.time(r2 <- unique(x[eval(y$x)]))
   user  system elapsed 
  0.540   0.142   0.723 
In a more general case, you can do something like
x[eval(y[, key(x), with = FALSE])]
One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
   x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:
> x[!y]
    x y
1:  2 b
2:  4 d
3:  6 f
4:  7 g
5:  8 h
6:  9 i
7: 10 j
It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.
The ! feature was added in data.table 1.8.4 with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i. DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works DT[!"a"] # same result, now preferred. DT[!J(6),...] # !J == not-join DT[!2:3,...] # ! on all types of i DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow) DT[!J(6L,23L)] # same result, faster binary search '!' has been used rather than '-' : * to match the 'not-join'/'not-where' nomenclature * with '-', DT[-0] would return DT rather than DT[0] and not be backwards compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in base R) and after this new feature. * to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].
This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:
"A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y"
library(data.table)
dt1 <-  data.table(ProdId = 1:4,
                   Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <-  data.table(ProdId = c(1, 1, 3, 4, 5),
                   Company = c("A", "B", "C", "D", "E"))
# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
   ProdId Product
1:      1   Bread
2:      3   Pizza
3:      4  Butter
I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique() to remove rows of first table which were repeated to match rows of second table.
Edit: The above approach will match dplyr::semi_join() output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff() method shown below.
Another one line data.table solution:
fsetdiff(dt1, dt1[!dt2, on="ProdId"])
   ProdId Product
1:      1   Bread
2:      3   Pizza
3:      4  Butter
I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:
fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)
The fsetdiff() result with ,all=T matches the output from dplyr:
dplyr::semi_join(dt1, dt2, by="ProdId")
  ProdId Product
1      1   Bread
2      3   Pizza
3      4  Butter
Using another set of data taken from one of previous posts:
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)
With dplyr:
dplyr::semi_join(x, y, by="x")
  x y
1 1 a
2 1 a
3 1 a
With data.table:
fsetdiff(x, x[!y, on="x"], all=T)
   x y
1: 1 a
2: 1 a
3: 1 a
Without ,all=T, the duplicate rows are removed:
fsetdiff(x, x[!y, on="x"])
   x y
1: 1 a
I'm confused with all the not-joins above, isn't what you want simply:
unique(x[y, .SD])
#   x y
#1: 1 a
If x can have duplicate keys, then you can unique y instead:
## Creating an example data.table 'a' three-times-repeated first row 
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)
x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
#    x y
# 1: 1 a
# 2: 1 a
# 3: 1 a
The package dplyr supports the following four join types:
inner_join, left_join, semi_join, anti_join
So for the semi-join try the following code
library("dplyr")
table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)
semi_join(table1, table2)
The output is as expected:
# Joining by: "x"
# Source: local data table [1 x 2]
# 
#       x     y
#   (int) (chr)
# 1     1     a
I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
    ycols <- 1:min(ncol(y),length(key(x)))
    yjoin <- unique(y[, ..ycols])
    yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
   x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE),  z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
    sJ=x[eval(sJ(x,y))],
    dolla=unique(x[eval(y$x)]),
    brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
  expr       min        lq    median        uq      max neval
 #    sJ 120.22700 125.04900 126.50704 132.35326 217.6566   100
 # dolla 105.05373 108.33804 109.16249 118.17613 285.9814   100
 # brack  53.95656  61.32669  61.88227  65.21571 235.8048   100
I'm guessing the [[ vs $ doesn't help the speed, but didn't check.