Within ID, check for matches/differences

前端未结

关注

 4  770

自闭症患者 2021-01-12 00:11

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj

4条回答

死守一世寂寞 (楼主)

2021-01-12 00:27

With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:

test <- test[order(test$ID), ]
n <- nrow(test)
ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
unique(test$ID[c(FALSE,ind)])

For test data timing is similar to Joris idea, but for large data:

test2 <- data.frame(
    ID = rep(1:600000,3),
    DOB = "2000-01-01",
    stringsAsFactors=FALSE
)
test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"

system.time(resA<-{
    x <- unique(test2[c("ID","DOB")])
    x$ID[duplicated(x$ID)]
})
#   user  system elapsed 
#   7.44    0.14    7.58 

system.time(resB <- {
    test2 <- test2[order(test2$ID), ]
    n <- nrow(test2)
    ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
    unique(test2$ID[c(FALSE,ind)])
})
#   user  system elapsed 
#   0.76    0.04    0.81 

all.equal(sort(resA),sort(resB))
# [1] TRUE

0 讨论(0)

查看其它4个回答