Within ID, check for matches/differences

前端未结

关注

 4  797

自闭症患者 2021-01-12 00:11

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj

4条回答

暗喜 (楼主)

2021-01-12 00:30

Using base functions, the fastest solution would be something like :

> x <- unique(test[c("ID","DOB")])
> x$ID[duplicated(x$ID)]
[1] 2

Timing :

n <- 1000
system.time(replicate(n,{
  x <- unique(test[c("ID","DOB")])
  x$ID[duplicated(x$ID)]
 }))
   user  system elapsed 
   0.70    0.00    0.71 

system.time(replicate(n,{
  DOBError(data)
}))
   user  system elapsed 
   1.69    0.00    1.69 

system.time(replicate(n,{
  zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
  zzz[zzz$DOB > 1 ,]
}))
   user  system elapsed 
   4.23    0.02    4.27 

system.time(replicate(n,{
   zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
   zz[zz$dups > 1 ,]
}))
   user  system elapsed 
   6.63    0.01    6.64

0 讨论(0)

查看其它4个回答