Perform a semi-join with data.table

后端 未结 8 1374
天命终不由人
天命终不由人 2020-11-27 16:44

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the r

8条回答
  •  时光说笑
    2020-11-27 17:28

    One solution I can think of is:

    tmp <- x[!y]
    x[!tmp]
    

    In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:

    x <- data.table(x = 1:10, y = letters[1:10])
    setkey(x, x)
    y <- data.table(x = c(1,3,5,1), z = 1:4)
    
    > x[y]
       x y z
    1: 1 a 1
    2: 3 c 2
    3: 5 e 3
    4: 1 a 4
    

    The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:

    > x[!y]
        x y
    1:  2 b
    2:  4 d
    3:  6 f
    4:  7 g
    5:  8 h
    6:  9 i
    7: 10 j
    

    It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.

    The ! feature was added in data.table 1.8.4 with the following note in NEWS:

    o   A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
            DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
            DT[!"a"]                              # same result, now preferred.
            DT[!J(6),...]                         # !J == not-join
            DT[!2:3,...]                          # ! on all types of i
            DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)
            DT[!J(6L,23L)]                        # same result, faster binary search
        '!' has been used rather than '-' :
            * to match the 'not-join'/'not-where' nomenclature
            * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
              compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
              base R) and after this new feature.
            * to leave DT[+J...] and DT[-J...] available for future use
    

    For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.

    P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

提交回复
热议问题