R - Faster Way to Calculate Rolling Statistics Over a Variable Interval

后端 未结 4 1452
不知归路
不知归路 2020-12-03 02:00

I\'m curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (

4条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-03 02:41

    Think of all of the points connected as a chain. Think of this chain as a graph, where each data point is a node. Then, for each node, we want to find all other nodes that are distance w or less away. To do this, I first generate a matrix that gives pairwise distances. The nth row gives the distance for nodes n nodes apart.

    # First, some data
    x = sort(runif(25000,0,4*pi))
    y = sin(x) + rnorm(length(x),0,0.5)
    
    # calculate the rows of the matrix one by one
    # until the distance between the two closest nodes is greater than w
    # This algorithm is actually faster than `dist` because it usually stops
    # much sooner
    dl = list()
    dl[[1]] = diff(x)
    i = 1
    while( min(dl[[i]]) <= w ) {
      pdl = dl[[i]]
      dl[[i+1]] = pdl[-length(pdl)] + dl[[1]][-(1:i)]
      i = i+1
    }
    
    # turn the list of the rows into matrices
    rarray = do.call( rbind, lapply(dl,inf.pad,length(x)) )
    larray = do.call( rbind, lapply(dl,inf.pad,length(x),"right") )
    
    # extra function
    inf.pad = function(x,size,side="left") {
      if(side=="left") {
        x = c( x, rep(Inf, size-length(x) ) )
      } else {
        x = c( rep(Inf, size-length(x) ), x )
      }
      x
    }
    

    I then use the matrices to determine the edge of each window. For this example, I set w=2.

    # How many data points to look left or right at each data point
    lookr = colSums( rarray <= w )
    lookl = colSums( larray <= w )
    
    # convert these "look" variables to indeces of the input vector
    ri = 1:length(x) + lookr
    li = 1:length(x) - lookl
    

    With the windows defined, it's pretty simple to use the *apply functions to get the final answer.

    rolling.mean = vapply( mapply(':',li,ri), function(i) .Internal(mean(y[i])), 1 )
    

    All of the above code took about 50 seconds on my computer. This is a little faster than the rollmean_r function in my other answer. However, the especially nice thing here is that the indeces are provided. You could then use whatever R function you like with the *apply functions. For example,

    rolling.mean = vapply( mapply(':',li,ri), 
                                            function(i) .Internal(mean(y[i])), 1 )
    

    takes about 5 seconds. And,

    rolling.median = vapply( mapply(':',li,ri), 
                                            function(i) median(y[i]), 1 )
    

    takes about 14 seconds. If you wanted to, you could use the Rcpp function in my other answer to get the indeces.

提交回复
热议问题