R Checking for duplicates is painfully slow, even with mclapply

时光毁灭记忆、已成空白 提交于 2019-12-10 20:01:37

问题


I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

I'm trying to do this with the following code:

library("doMC")

Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))

IsGoodId = function(Id){
  Sub = Data[Data$ID==Id,]
  if (length(Sub[,1]) > 1){
    return(compare(Sub[,"Size"]))
  }else{
    return(TRUE)
  }
}

WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

But it's painfully, awfully, terribly slow on my quad-core i5.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

Thanks, -N


回答1:


Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)])
DupData = Data[Data$ID %in% dups,,drop=FALSE]

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

Hmm, thinking about this a bit more, I guess

u = unique(Data)
u$ID[duplicated(u$ID)]

does the trick.



来源:https://stackoverflow.com/questions/15690688/r-checking-for-duplicates-is-painfully-slow-even-with-mclapply

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!