I\'m curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (
Think of all of the points connected as a chain. Think of this chain as a graph, where each data point is a node. Then, for each node, we want to find all other nodes that are distance w or less away. To do this, I first generate a matrix that gives pairwise distances. The nth row gives the distance for nodes n nodes apart.
# First, some data
x = sort(runif(25000,0,4*pi))
y = sin(x) + rnorm(length(x),0,0.5)
# calculate the rows of the matrix one by one
# until the distance between the two closest nodes is greater than w
# This algorithm is actually faster than `dist` because it usually stops
# much sooner
dl = list()
dl[[1]] = diff(x)
i = 1
while( min(dl[[i]]) <= w ) {
pdl = dl[[i]]
dl[[i+1]] = pdl[-length(pdl)] + dl[[1]][-(1:i)]
i = i+1
}
# turn the list of the rows into matrices
rarray = do.call( rbind, lapply(dl,inf.pad,length(x)) )
larray = do.call( rbind, lapply(dl,inf.pad,length(x),"right") )
# extra function
inf.pad = function(x,size,side="left") {
if(side=="left") {
x = c( x, rep(Inf, size-length(x) ) )
} else {
x = c( rep(Inf, size-length(x) ), x )
}
x
}
I then use the matrices to determine the edge of each window. For this example, I set w=2.
# How many data points to look left or right at each data point
lookr = colSums( rarray <= w )
lookl = colSums( larray <= w )
# convert these "look" variables to indeces of the input vector
ri = 1:length(x) + lookr
li = 1:length(x) - lookl
With the windows defined, it's pretty simple to use the *apply functions to get the final answer.
rolling.mean = vapply( mapply(':',li,ri), function(i) .Internal(mean(y[i])), 1 )
All of the above code took about 50 seconds on my computer. This is a little faster than the rollmean_r function in my other answer. However, the especially nice thing here is that the indeces are provided. You could then use whatever R function you like with the *apply functions. For example,
rolling.mean = vapply( mapply(':',li,ri),
function(i) .Internal(mean(y[i])), 1 )
takes about 5 seconds. And,
rolling.median = vapply( mapply(':',li,ri),
function(i) median(y[i]), 1 )
takes about 14 seconds. If you wanted to, you could use the Rcpp function in my other answer to get the indeces.