if I understand correctly, duplicated()
function for data.table
returns a logical vector which doesn\'t contain first occurrence of duplicated reco
A third approach, (that appears more efficient for this small example)
You can explicitly call duplicated.data.frame
....
myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)]
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)])
## Unit: microseconds
## expr min lq median uq max neval
## key 556.608 575.9265 588.906 600.9795 27713.242 100
## unique 1112.913 1164.8310 1183.244 1216.9000 2263.557 100
## dup 420.173 436.3220 448.396 461.3750 699.986 100
If we expand the size of the sample data.table
, then the key
approach is the clear winner
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## key 355.9258 358.1764 360.7628 450.9218 500.8360 10
## unique 451.3794 458.0258 483.3655 519.3341 553.2515 10
## dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012 10