Loop , Monte Carlo and bootstrap script takes too long to run

谁说我不能喝 提交于 2020-06-11 06:09:28

问题


I have the R script below which takes more than 24 hours to but finally runs on Windows 10 with 10 gigabytes of RAM and an Intel Core M7 processor. The script does the following:

  • A. I have generated a 50-time series dataset.

  • B. I slice the same time series data set into chunks of the following sizes: 2,3,...,48,49 making me have 48 different time series formed from step 1 above.

  • C. I divided each 48-time series data set into train and test sets so I can use rmse() function in Metrics package to get the Root Mean Squared Error (RMSE) for the 48 subseries formed in step 2.

  • D. The RMSE for each series is then tabulated according to their chunk sizes

  • E. I obtained the best ARIMA model for each 48 different time series data set.

My R Script

# simulate arima(1,0,0)
library(forecast)
library(Metrics)

n=50
phi <- 0.5
set.seed(1)

wn <- rnorm(n, mean=0, sd=1)
ar1 <- sqrt((wn[1])^2/(1-phi^2))

for(i in 2:n){
  ar1[i] <- ar1[i - 1] * phi + wn[i]
}
ts <- ar1

t <- length(ts)    # the length of the time series
li <- seq(n-2)+1   # vector of block sizes to be 1 < l < n (i.e to be between 1 and n exclusively)

# vector to store block means
RMSEblk <- matrix(nrow = 1, ncol = length(li))
colnames(RMSEblk) <-li

for (b in 1:length(li)){
    l <- li[b]# block size
    m <- ceiling(t / l)                                 # number of blocks
    blk <- split(ts, rep(1:m, each=l, length.out = t))  # divides the series into blocks

    # initialize vector to receive result from for loop
    singleblock <- vector()                     
    for(i in 1:1000){
        res<-sample(blk, replace=T, 10000)        # resamples the blocks
        res.unlist<-unlist(res, use.names = F)    # unlist the bootstrap series
        # Split the series into train and test set
        train <- head(res.unlist, round(length(res.unlist) * 0.6))
        h <- length(res.unlist) - length(train)
        test <- tail(res.unlist, h)

        # Forecast for train set
        model <- auto.arima(train)
        future <- forecast(test, model=model,h=h)
        nfuture <- as.numeric(future$mean)        # makes the `future` object a vector            
        RMSE <- rmse(test, nfuture)               # use the `rmse` function from `Metrics` package

        singleblock[i] <- RMSE # Assign RMSE value to final result vector element i
    }

    RMSEblk[b] <- mean(singleblock) # store into matrix
}

RMSEblk

The script actually runs but it takes more than 24 hours to complete. The number of runs in the loops (10000 and 1000) are the minimum requirement for the task.

Please what can I do to make the script complete in less time?


回答1:


tl;dr you're probably going to have to parallelize this somehow.


One problem is that you are growing an object; that is, first you're allocating a zero-length vector (singleblock <- vector()), then you're incrementing it one element at a time (singleblock[i] <- RMSE). As discussed in Chapter 2 of the R Inferno, this is super-inefficient. For this sample it's 5x slower.

f1 <- function(x) { p <- numeric(0); for (i in 1:1000) p[i] <- 0 }
f2 <- function(x) { p <- numeric(1000); for (i in 1:1000) p[i] <- 0 }
microbenchmark(f1(),f2())
## Unit: microseconds
##  expr     min       lq      mean  median      uq     max neval cld
##  f1() 202.519 207.2105 249.84095 210.574 221.340 3504.95   100   b
##  f2()  40.274  40.6710  69.83741  40.9615  42.8275 2811.779   100  a 

However: that's not really relevant. The inefficient version of this (growing the vector) takes a median time of 210 microseconds.

microbenchmark(auto.arima(train),times=20L)
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max neval
##  auto.arima(train) 630.7335 648.3471 679.2703 657.6697 668.0563 829.1648    20

Your auto.arima() call takes about 660 milliseconds - about 3000 times longer. Using a similar microbenchmark call for the forecasting step gives a median time of about 20 milliseconds.

You could do more formal profiling, or continue in bits and pieces as shown here, but I don't see anything in your code that looks like it would take a long time (I'd probably check sample() next, but I doubt it's comparable to auto.arima().)

Unless you can find a faster version of auto.arima() (I doubt it), or strip things down (e.g. restrict the search space), your only remaining choice is to parallelize. You can do this at many different levels, with many different tools, but the first place to look would be the parallel option to auto.arima. You might choose instead to parallelize the loop (doing a web search on 'parallel computation in R' gives lots of resources); be aware that trying to parallelize at more than one level can bite you.

PS the crude calculation (48000 * 660 milliseconds) gives about 9 hours - that only accounts for about 1/3 of the time (I would have expected it to get to 80% or so); maybe your processor is slower than mine?




回答2:


For demonstration, to avoid growing objects in a loop, consider apply family solutions such as vapply. Notice how RMSEblk and singleblock are now directly assigned the result of vapply without bookkeeping of assigning elements by index.

...

# DEFINED METHOD
proc_bootstrap <- function(b) {
    l <- li[b]                                          # block size
    m <- ceiling(t / l)                                 # number of blocks
    blk <- split(ts, rep(1:m, each=l, length.out = t))  # divides the series into blocks

    # initialize vector to receive result from for loop
    singleblock <- vapply(1:1000, function(i) {
      res <- sample(blk, replace=TRUE, 10000)        # resamples the blocks
      res.unlist <- unlist(res, use.names = FALSE)   # unlist the bootstrap series

      # Split the series into train and test set
      train <- head(res.unlist, round(length(res.unlist) * 0.6))
      h <- length(res.unlist) - length(train)
      test <- tail(res.unlist, h)

      # Forecast for train set
      model <- auto.arima(train)
      future <- forecast(test, model=model,h=h)
      nfuture <- as.numeric(future$mean)        # makes the `future` object a vector

      RMSE <- Metrics::rmse(test, nfuture)      # RETURN RMSE
    }, numeric(1))

    mean(singleblock)                           # RETURN MEAN
  }

# VAPPLY CALL
RMSEblk <- vapply(1:length(li), proc_bootstrap, numeric(1))

Alternatively, to fill in your originally defined one-row matrix (maybe better as a named vector?):

# MATRIX to store block means
RMSEblk <- matrix(nrow = 1, ncol = length(li))
colnames(RMSEblk) <-li

RMSEblk[] <- vapply(1:length(li), proc_bootstrap, numeric(1))

Note: Above may not materially differ from nested for loops in timings as you still iterate through 48,000 models calls. Possibly, though, this solution can scale better on larger iterations. But as discussed, look into parallel processing (see parallel, doParallel, foreach packages) which can be translated from for or apply solutions.


Be sure to also profile which shows (outside of modeling calls) unlist, head, tail to have timing issues:

utils::Rprof(tmp <- tempfile(), memory.profiling = TRUE)
RMSEblk <- vapply(1:length(li), proc_bootstrap, numeric(1))
utils::Rprof(NULL)
summaryRprof(tmp, memory="both")
unlink(tmp)


来源:https://stackoverflow.com/questions/61979057/loop-monte-carlo-and-bootstrap-script-takes-too-long-to-run

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!