Within ID, check for matches/differences

前端 未结 4 797
自闭症患者
自闭症患者 2021-01-12 00:11

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj

4条回答
  •  暗喜
    暗喜 (楼主)
    2021-01-12 00:30

    Using base functions, the fastest solution would be something like :

    > x <- unique(test[c("ID","DOB")])
    > x$ID[duplicated(x$ID)]
    [1] 2
    

    Timing :

    n <- 1000
    system.time(replicate(n,{
      x <- unique(test[c("ID","DOB")])
      x$ID[duplicated(x$ID)]
     }))
       user  system elapsed 
       0.70    0.00    0.71 
    
    system.time(replicate(n,{
      DOBError(data)
    }))
       user  system elapsed 
       1.69    0.00    1.69 
    
    system.time(replicate(n,{
      zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
      zzz[zzz$DOB > 1 ,]
    }))
       user  system elapsed 
       4.23    0.02    4.27 
    
    system.time(replicate(n,{
       zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
       zz[zz$dups > 1 ,]
    }))
       user  system elapsed 
       6.63    0.01    6.64 
    

提交回复
热议问题