Need faster rolling apply function with start to stop indices

坚强是说给别人听的谎言 提交于 2019-12-04 14:55:36

There may be a way to accelerate the rolling application, but due to the changing window size I think the standard tools (e.g. rollapply) don't work, though perhaps some with more familiarity with them will have ideas. In the meantime, you can optimize your percentile calculation. Instead of using ecdf which creates a function with all the associated overhead, you can calculate a decent approximation directly:

> vec <- rnorm(10000, 0, 3)
> val <- 5
> max(which(sort(vec) < val)) / length(vec)
[1] 0.9543
> ecdf(vec)(val)
[1] 0.9543
> microbenchmark(max(which(sort(vec) < val)) / length(vec))
Unit: milliseconds
expr      min       lq   median       uq      max neval
max(which(sort(vec) < val))/length(vec) 1.093434 1.105231 1.116364 1.141204 1.449141   100
> microbenchmark(ecdf(vec)(val))
Unit: milliseconds
expr      min       lq   median       uq      max neval
ecdf(vec)(val) 2.552946 2.808041 3.043579 3.439269 4.208202   100

Roughly 2.5x improvement. The improvement is greater for smaller samples.

Allright, this question got me interested. These are the things I did:

  1. replacing ecdf with the custom percentile computation
  2. changing time.pos to numeric (since it's in seconds anyway), as there is additional overhead from [.POSIXct vs [
  3. changed t_15_index to only look back as far as the previous earliest time stamp, since the data is sorted so we don't need to look back all the way to index 1.

And this is the outcome:

> system.time(res2 <- get_percentile2(trade))
  user  system elapsed 
14.458   0.768  15.215 
> system.time(res1 <- get_percentile(trade))
   user  system elapsed 
110.851  17.974 128.736 

And demonstrating the outputs are the same:

tail(sort(abs(res1$price_percentile - res2$price_percentile)), 50)
#  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [45] 0 0 0 0 0 0

Approx 8.5x improvement. Note this improvement is much greater if you have fewer items per 15 minute interval. This is cramming 45K points in 24 hours. So if your 45K are actually over a few days you can expect more improvement. Here is the code:

t_15_index2 <- function(data_vector,index, min.index) {
  which(data_vector[index] - data_vector[min.index:index]<=minutes*60) + min.index - 1L
}
get_percentile2 <- function(trade) {
  len_d <- dim(trade)[1]  
  price_percentile = vector(length=len_d)
  min.index <- 1  
  for(i in 1: len_d) {
    t_15 = t_15_index2(trade$time.pos.2,i, min.index)
    vec <- rep(trade$prc[t_15],trade$siz[t_15])
    price_percentile[i] <- max(0, which(sort(vec) <= trade$prc[i])) / length(vec)
    min.index <- t_15[[1]]
  }
  trade$price_percentile = price_percentile
  trade
}

And here is the data

start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
set.seed(1)
minutes <- 15
ticks <- 45000
times <- sort(as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01")))
trade <- data.frame(
  time=as.character(times),
  prc=100 + rnorm(ticks, 0, 5),
  siz=sample(1:10, ticks, rep=T),
  time.pos=times,
  time.pos.2=as.numeric(times)
)

And finally, the original functions slightly modified, but basically the same:

t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=minutes*60)
}
get_percentile <- function(trade) {
  len_d <- dim(trade)[1]    
  price_percentile = vector(length=len_d)  
  for(i in 1: len_d) {       
    t_15 = t_15_index(trade$time.pos,i)
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}

Also, ddply will definitely not help with this. I also don't think data.table will help too much because the main lookup here on the large vector is done with integer indices. It could help a little. I may check that out later.


EDIT: there is one more way you could probably get some fairly significance performance improvements is to update your t_15 vector 1 at a time instead of re-selecting it wholesale every iteration. In other words, find the next time value, and drop the values that are not within 15 minutes in your t_15 vector. Doing this is fairly complex in your case since your 15 minute window varies in how many items it can contain. You should size your baseline t_15 vector be large enough to contain most windows, and anytime you encounter a window larger, then expand your vector (with the assumption this won't happen often). I'm not going to do this because dealing with the changing window size actually requires a fair bit of book-keeping, but to give you an idea, here is an example where we loop through a 1MM vector with a 2K window:

# Version that pulls whole 2000 entries each time
sub.vec <- numeric(2000)
system.time(r1 <- for(i in seq_len(length(vec) - 2000)) sub.vec <- vec[i:(i+1999)])
#  user  system elapsed 
# 17.507   4.723  22.211 

# Version that overwrites 1 value at a time keeping the most recent 2K
sub.vec <- numeric(2001) # need to make this slightly larger because of 2000 %% 2000 == 0
system.time(r2 <- for(i in seq_len(length(vec) - 2000)) sub.vec[[(i %% 2000) + 1]] <- vec[[i]])

#  user  system elapsed 
# 2.642   0.009   2.650 

all.equal(r1, tail(r2, -1L))
# [1] TRUE

Finally, if you do it this way, you may also be able to come up with clever ways to re-calculate percentiles very quickly given you know what your 15 minute contained, what was added, and what was removed.

Not 100% sure if the bookkeeping required to do the FIFO 15 minute window would end up overcoming the benefits of doing it though.

Here's a quick stab at more efficiently finding the index of the time that occurred fifteen minutes ago:

# Create some sample data (modified from BrodieG)
set.seed(1)

ticks <- 45000
start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
times <- as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01"))
trade <- data.frame(
  time = sort(times),
  prc = 100 + rnorm(ticks, 0, 5),
  siz = sample(1:10, ticks, rep = T)
)

# For vector of times, find the index of the first time that was at least
# fifteen minutes before the current time. Assumes times are sorted
minutes_ago <- function(time, minutes = 15) {
  secs <- minutes * 60
  time <- as.numeric(time)
  out <- integer(length(time))

  before <- 1

  for(i in seq_along(out)) {
    while(time[before] < time[i] - secs) {
      before <- before + 1
    }
    out[i] <- before

  }
  out
}
system.time(minutes_ago(trade$time))
# Takes about 0.2s on my machine

library(Rcpp)
cppFunction("IntegerVector minutes_ago2(NumericVector time, int minutes = 15) {
  int secs = minutes * 60;
  int n = time.size();
  IntegerVector out(n);

  int before = 0;
  for (int i = 0; i < n; ++i) {
    # Could do even better here with a binary search
    while(time[before] < time[i] - secs) {
      before++;
    }
    out[i] = before + 1;
  }
  return out;
}")

system.time(minutes_ago2(trade$time, 10))
# Takes less than < 0.001

all.equal(minutes_ago(trade$time, 15), minutes_ago2(trade$time, 15))

If you want to use ecdf within dplyr, use seq_along/sapply within mutate to get faster results

y <- x %>% mutate(percentile.score = sapply(seq_along(score), function(i){sum(score[1:i] <= score[i])/i}))
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!