“smoothing” time data - can it be done more efficient?

后端 未结 3 766
感情败类
感情败类 2020-12-22 11:06

I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).

Now I want all rows with the same ID hav

3条回答
  •  情话喂你
    2020-12-22 11:36

    Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.

    As a side note, I try to use [[ where applicable, and try to keep integers as integers by writing j <- j + 1L etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.

    smoothingEpisodes3 <- function (theData) {
        theLength <- nrow(theData)
        if (theLength < 2L) return(theData)
    
        id <- as.integer(theData[["ID"]])
        start <- as.numeric(theData[["START"]])
        end <- as.numeric(theData[["END"]])
    
        curId <- id[[1L]]
        curStart <- start[[1L]]
        curEnd <- end[[1L]]
    
        out.1 <- integer(length = theLength)
        out.2 <- out.3 <- numeric(length = theLength)
    
        j <- 1L
    
        for(i in 2:nrow(theData)) {
            nextId <- id[[i]]
            nextStart <- start[[i]]
            nextEnd <- end[[i]]
    
            if (curId != nextId | (curEnd + 1) < nextStart) {
                out.1[[j]] <- curId
                out.2[[j]] <- curStart
                out.3[[j]] <- curEnd
    
                j <- j + 1L
    
                curId <- nextId
                curStart <- nextStart
                curEnd <- nextEnd
            } else {
                curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
            }
        }
    
        out.1[[j]] <- curId
        out.2[[j]] <- curStart
        out.3[[j]] <- curEnd
    
        theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    
        theOutput
    }
    

    Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...

    x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
    11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
        END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
        11062, 11123, 11153), class = "Date")), .Names = c("ID", 
    "START", "END"), class = "data.frame", row.names = c(NA, 9L))
    
    r <- 1000
    y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
    
    system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
    system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
    all.equal( a1, a2 )
    

提交回复
热议问题