I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj
Using base functions, the fastest solution would be something like :
> x <- unique(test[c("ID","DOB")])
> x$ID[duplicated(x$ID)]
[1] 2
Timing :
n <- 1000
system.time(replicate(n,{
x <- unique(test[c("ID","DOB")])
x$ID[duplicated(x$ID)]
}))
user system elapsed
0.70 0.00 0.71
system.time(replicate(n,{
DOBError(data)
}))
user system elapsed
1.69 0.00 1.69
system.time(replicate(n,{
zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
zzz[zzz$DOB > 1 ,]
}))
user system elapsed
4.23 0.02 4.27
system.time(replicate(n,{
zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
zz[zz$dups > 1 ,]
}))
user system elapsed
6.63 0.01 6.64