Perform a semi-join with data.table

后端 未结 8 1371
天命终不由人
天命终不由人 2020-11-27 16:44

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the r

8条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-27 17:40

    I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.

    sJ <- function(x,y){
        ycols <- 1:min(ncol(y),length(key(x)))
        yjoin <- unique(y[, ..ycols])
        yjoin
    }
    
    x[eval(sJ(x,y))]
    

    For Victor's simpler example, this gives the desired output:

       x y
    1: 1 a
    2: 3 c
    3: 5 e
    

    This is a ~30% slower than Victor's way.

    EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:

    N <- 1e5*26
    x <- data.table(x = 1:N, y = letters, z = rnorm(N))
    setkey(x, x)
    y <- data.table(x = sample(N, N/10, replace = TRUE),  z = sample(letters, N/10, replace = TRUE))
    setkey(y, x)
    require(microbenchmark)
    microbenchmark(
        sJ=x[eval(sJ(x,y))],
        dolla=unique(x[eval(y$x)]),
        brack=x[eval(unique(y[['x']]))]
    )
    Unit: milliseconds
      expr       min        lq    median        uq      max neval
     #    sJ 120.22700 125.04900 126.50704 132.35326 217.6566   100
     # dolla 105.05373 108.33804 109.16249 118.17613 285.9814   100
     # brack  53.95656  61.32669  61.88227  65.21571 235.8048   100
    

    I'm guessing the [[ vs $ doesn't help the speed, but didn't check.

提交回复
热议问题