find all duplicated records in data.table (not all-but-one)

前端未结

关注

 4  571

Happy的楠姐 2020-12-15 03:27

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn\'t contain first occurrence of duplicated reco

4条回答

野趣味 (楼主)

2020-12-15 03:55

A third approach, (that appears more efficient for this small example)

You can explicitly call duplicated.data.frame....

myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
  .SDcols = key(myDT)]


 microbenchmark(
   key=myDT[, fD := .N > 1, by = key(myDT)],
   unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE), 
    .SDcols = key(myDT)])
## Unit: microseconds
##    expr      min        lq   median        uq       max neval
##     key  556.608  575.9265  588.906  600.9795 27713.242   100
##  unique 1112.913 1164.8310 1183.244 1216.9000  2263.557   100
##     dup  420.173  436.3220  448.396  461.3750   699.986   100

If we expand the size of the sample data.table, then the key approach is the clear winner

 myDT <- data.table(id = sample(1e6), 
  fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
  fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))

microbenchmark(
  key=myDT[, fD := .N > 1, by = key(myDT)],
  unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
   .SDcols = key(myDT)],times=10)
## Unit: milliseconds
##    expr       min        lq    median        uq       max neval
##     key  355.9258  358.1764  360.7628  450.9218  500.8360    10
##  unique  451.3794  458.0258  483.3655  519.3341  553.2515    10
##     dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012    10

0 讨论(0)

查看其它4个回答

find *all* duplicated records in data.table (not all-but-one)

find all duplicated records in data.table (not all-but-one)