I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).
Now I want all rows with the same ID hav
Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.
As a side note, I try to use [[
where applicable, and try to keep integers as integers by writing j <- j + 1L
etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.
smoothingEpisodes3 <- function (theData) {
theLength <- nrow(theData)
if (theLength < 2L) return(theData)
id <- as.integer(theData[["ID"]])
start <- as.numeric(theData[["START"]])
end <- as.numeric(theData[["END"]])
curId <- id[[1L]]
curStart <- start[[1L]]
curEnd <- end[[1L]]
out.1 <- integer(length = theLength)
out.2 <- out.3 <- numeric(length = theLength)
j <- 1L
for(i in 2:nrow(theData)) {
nextId <- id[[i]]
nextStart <- start[[i]]
nextEnd <- end[[i]]
if (curId != nextId | (curEnd + 1) < nextStart) {
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
j <- j + 1L
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...
x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
system.time( a1 <- smoothingEpisodes(y) ) # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) ) # 0.10 seconds
all.equal( a1, a2 )