Need faster rolling apply function with start to stop indices

问题

Below is the piece of code. It gives percentile of the trade price level for rolling 15-minute(historical) window. It runs quickly if the length is 500 or 1000, but as you can see there are 45K observations, and for the entire data its very slow. Can I apply any of the plyr functions? Any other suggestions are welcome.

This is how trade data looks like:

> str(trade)
'data.frame':   45571 obs. of  5 variables:
 $ time    : chr  "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...
 $ prc     : num  121 121 121 121 121 ...
 $ siz     : int  1 4 1 2 3 3 2 2 3 4 ...
 $ aggress : chr  "B" "B" "B" "B" ...
 $ time.pos: POSIXlt, format: "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...

And this is how the data looks like after the new column trade$time.pos

trade$time.pos <- strptime(trade$time, format="%Y-%m-%d %H:%M:%OS") 

> head(trade)
                     time      prc siz aggress                time.pos
1 2013-10-20 22:00:00.489 121.3672   1       B 2013-10-20 22:00:00.489
2 2013-10-20 22:00:00.807 121.3750   4       B 2013-10-20 22:00:00.807
3 2013-10-20 22:00:00.811 121.3750   1       B 2013-10-20 22:00:00.811
4 2013-10-20 22:00:00.811 121.3750   2       B 2013-10-20 22:00:00.811
5 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811
6 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811

#t_15_index function returns the indices of the trades that were executed in last 15 minutes from the current trade(t-15 to t).
t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=15*60)
}

get_percentile <- function(data) {
  len_d <- dim(trade)[1]  

  price_percentile = vector(length=len_d)  

  for(i in 1: len_d) {   

    t_15 = t_15_index(trade$time.pos,i)
    #ecdf(rep(..)) gets the empirical distribution of the the trade size on a particular trade-price level
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    #percentile of the current price level depending on current (t-15 to t) subset of data
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}


res_trade = get_percentile(trade)

回答1:

There may be a way to accelerate the rolling application, but due to the changing window size I think the standard tools (e.g. rollapply) don't work, though perhaps some with more familiarity with them will have ideas. In the meantime, you can optimize your percentile calculation. Instead of using ecdf which creates a function with all the associated overhead, you can calculate a decent approximation directly:

> vec <- rnorm(10000, 0, 3)
> val <- 5
> max(which(sort(vec) < val)) / length(vec)
[1] 0.9543
> ecdf(vec)(val)
[1] 0.9543
> microbenchmark(max(which(sort(vec) < val)) / length(vec))
Unit: milliseconds
expr      min       lq   median       uq      max neval
max(which(sort(vec) < val))/length(vec) 1.093434 1.105231 1.116364 1.141204 1.449141   100
> microbenchmark(ecdf(vec)(val))
Unit: milliseconds
expr      min       lq   median       uq      max neval
ecdf(vec)(val) 2.552946 2.808041 3.043579 3.439269 4.208202   100

Roughly 2.5x improvement. The improvement is greater for smaller samples.

回答2:

Allright, this question got me interested. These are the things I did:

replacing ecdf with the custom percentile computation
changing time.pos to numeric (since it's in seconds anyway), as there is additional overhead from [.POSIXct vs [
changed t_15_index to only look back as far as the previous earliest time stamp, since the data is sorted so we don't need to look back all the way to index 1.

And this is the outcome:

> system.time(res2 <- get_percentile2(trade))
  user  system elapsed 
14.458   0.768  15.215 
> system.time(res1 <- get_percentile(trade))
   user  system elapsed 
110.851  17.974 128.736

And demonstrating the outputs are the same:

tail(sort(abs(res1$price_percentile - res2$price_percentile)), 50)
#  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [45] 0 0 0 0 0 0

Approx 8.5x improvement. Note this improvement is much greater if you have fewer items per 15 minute interval. This is cramming 45K points in 24 hours. So if your 45K are actually over a few days you can expect more improvement. Here is the code:

t_15_index2 <- function(data_vector,index, min.index) {
  which(data_vector[index] - data_vector[min.index:index]<=minutes*60) + min.index - 1L
}
get_percentile2 <- function(trade) {
  len_d <- dim(trade)[1]  
  price_percentile = vector(length=len_d)
  min.index <- 1  
  for(i in 1: len_d) {
    t_15 = t_15_index2(trade$time.pos.2,i, min.index)
    vec <- rep(trade$prc[t_15],trade$siz[t_15])
    price_percentile[i] <- max(0, which(sort(vec) <= trade$prc[i])) / length(vec)
    min.index <- t_15[[1]]
  }
  trade$price_percentile = price_percentile
  trade
}

And here is the data

start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
set.seed(1)
minutes <- 15
ticks <- 45000
times <- sort(as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01")))
trade <- data.frame(
  time=as.character(times),
  prc=100 + rnorm(ticks, 0, 5),
  siz=sample(1:10, ticks, rep=T),
  time.pos=times,
  time.pos.2=as.numeric(times)
)

And finally, the original functions slightly modified, but basically the same:

t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=minutes*60)
}
get_percentile <- function(trade) {
  len_d <- dim(trade)[1]    
  price_percentile = vector(length=len_d)  
  for(i in 1: len_d) {       
    t_15 = t_15_index(trade$time.pos,i)
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}

Also, ddply will definitely not help with this. I also don't think data.table will help too much because the main lookup here on the large vector is done with integer indices. It could help a little. I may check that out later.

EDIT: there is one more way you could probably get some fairly significance performance improvements is to update your t_15 vector 1 at a time instead of re-selecting it wholesale every iteration. In other words, find the next time value, and drop the values that are not within 15 minutes in your t_15 vector. Doing this is fairly complex in your case since your 15 minute window varies in how many items it can contain. You should size your baseline t_15 vector be large enough to contain most windows, and anytime you encounter a window larger, then expand your vector (with the assumption this won't happen often). I'm not going to do this because dealing with the changing window size actually requires a fair bit of book-keeping, but to give you an idea, here is an example where we loop through a 1MM vector with a 2K window:

# Version that pulls whole 2000 entries each time
sub.vec <- numeric(2000)
system.time(r1 <- for(i in seq_len(length(vec) - 2000)) sub.vec <- vec[i:(i+1999)])
#  user  system elapsed 
# 17.507   4.723  22.211 

# Version that overwrites 1 value at a time keeping the most recent 2K
sub.vec <- numeric(2001) # need to make this slightly larger because of 2000 %% 2000 == 0
system.time(r2 <- for(i in seq_len(length(vec) - 2000)) sub.vec[[(i %% 2000) + 1]] <- vec[[i]])

#  user  system elapsed 
# 2.642   0.009   2.650 

all.equal(r1, tail(r2, -1L))
# [1] TRUE

Finally, if you do it this way, you may also be able to come up with clever ways to re-calculate percentiles very quickly given you know what your 15 minute contained, what was added, and what was removed.

Not 100% sure if the bookkeeping required to do the FIFO 15 minute window would end up overcoming the benefits of doing it though.

回答3:

Here's a quick stab at more efficiently finding the index of the time that occurred fifteen minutes ago:

# Create some sample data (modified from BrodieG)
set.seed(1)

ticks <- 45000
start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
times <- as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01"))
trade <- data.frame(
  time = sort(times),
  prc = 100 + rnorm(ticks, 0, 5),
  siz = sample(1:10, ticks, rep = T)
)

# For vector of times, find the index of the first time that was at least
# fifteen minutes before the current time. Assumes times are sorted
minutes_ago <- function(time, minutes = 15) {
  secs <- minutes * 60
  time <- as.numeric(time)
  out <- integer(length(time))

  before <- 1

  for(i in seq_along(out)) {
    while(time[before] < time[i] - secs) {
      before <- before + 1
    }
    out[i] <- before

  }
  out
}
system.time(minutes_ago(trade$time))
# Takes about 0.2s on my machine

library(Rcpp)
cppFunction("IntegerVector minutes_ago2(NumericVector time, int minutes = 15) {
  int secs = minutes * 60;
  int n = time.size();
  IntegerVector out(n);

  int before = 0;
  for (int i = 0; i < n; ++i) {
    # Could do even better here with a binary search
    while(time[before] < time[i] - secs) {
      before++;
    }
    out[i] = before + 1;
  }
  return out;
}")

system.time(minutes_ago2(trade$time, 10))
# Takes less than < 0.001

all.equal(minutes_ago(trade$time, 15), minutes_ago2(trade$time, 15))

回答4:

If you want to use ecdf within dplyr, use seq_along/sapply within mutate to get faster results

y <- x %>% mutate(percentile.score = sapply(seq_along(score), function(i){sum(score[1:i] <= score[i])/i}))

来源：https://stackoverflow.com/questions/21062927/need-faster-rolling-apply-function-with-start-to-stop-indices

标签

performance

plyr