data.table with two string columns of set elements, extract unique rows with each row unsorted

前端 未结 3 1106
礼貌的吻别
礼貌的吻别 2020-11-27 07:45

Suppose I have a data.table like this:

Table:

V1 V2
 A  B
 C  D
 C  A
 B  A
 D  C

I want each row to be regarded as a set, which me

3条回答
  •  -上瘾入骨i
    2020-11-27 08:28

    Borrowing (probably unrealistic) data from a dupe:

    library(data.table)
    size <- 118000000
    key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
    key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
    val <- runif(size, 0.0, 5.0)
    
    dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
    

    Here's a fast way if your data looks like this:

    # eddi's answer
    system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1])
    #    user  system elapsed 
    #  101.79    3.01  107.98 
    
    # optimized for this data
    system.time({
      dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)]
      res2 <- unique(dt2, by=c("key1", "key2")) 
    })
    #    user  system elapsed 
    #    8.50    1.16    4.93 
    
    fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2)
    # [1] TRUE
    

    Data like this seems unlikely if it pertains to covariances, since you should have at most one duplicate (ie, A-B with B-A).

提交回复
热议问题