find *all* duplicated records in data.table (not all-but-one)

前端 未结 4 571
Happy的楠姐
Happy的楠姐 2020-12-15 03:27

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn\'t contain first occurrence of duplicated reco

4条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-15 03:55

    A third approach, (that appears more efficient for this small example)

    You can explicitly call duplicated.data.frame....

    myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
      .SDcols = key(myDT)]
    
    
     microbenchmark(
       key=myDT[, fD := .N > 1, by = key(myDT)],
       unique=myDT[unique(myDT),fD:=.N>1], 
      dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE), 
        .SDcols = key(myDT)])
    ## Unit: microseconds
    ##    expr      min        lq   median        uq       max neval
    ##     key  556.608  575.9265  588.906  600.9795 27713.242   100
    ##  unique 1112.913 1164.8310 1183.244 1216.9000  2263.557   100
    ##     dup  420.173  436.3220  448.396  461.3750   699.986   100
    

    If we expand the size of the sample data.table, then the key approach is the clear winner

     myDT <- data.table(id = sample(1e6), 
      fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
      fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
    setkeyv(myDT, c('fB', 'fC'))
    
    microbenchmark(
      key=myDT[, fD := .N > 1, by = key(myDT)],
      unique=myDT[unique(myDT),fD:=.N>1], 
      dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
       .SDcols = key(myDT)],times=10)
    ## Unit: milliseconds
    ##    expr       min        lq    median        uq       max neval
    ##     key  355.9258  358.1764  360.7628  450.9218  500.8360    10
    ##  unique  451.3794  458.0258  483.3655  519.3341  553.2515    10
    ##     dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012    10
    

提交回复
热议问题