I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj
With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:
test <- test[order(test$ID), ]
n <- nrow(test)
ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
unique(test$ID[c(FALSE,ind)])
For test data timing is similar to Joris idea, but for large data:
test2 <- data.frame(
ID = rep(1:600000,3),
DOB = "2000-01-01",
stringsAsFactors=FALSE
)
test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"
system.time(resA<-{
x <- unique(test2[c("ID","DOB")])
x$ID[duplicated(x$ID)]
})
# user system elapsed
# 7.44 0.14 7.58
system.time(resB <- {
test2 <- test2[order(test2$ID), ]
n <- nrow(test2)
ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
unique(test2$ID[c(FALSE,ind)])
})
# user system elapsed
# 0.76 0.04 0.81
all.equal(sort(resA),sort(resB))
# [1] TRUE