Calculate vector whose length is not known beforehand - should I “grow” it?

女生的网名这么多〃 提交于 2019-12-04 01:16:36

I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.

Sounds like you are referring to the accepted answer of Collecting an unknown number of results in a loop. Have you coded it up and tried it? The idea of length doubling is more than sufficient (see the end of this answer), as the length will grow geometrically. I will demonstrate my method in the following.


For test purpose, wrap your code in a function. Note how I avoid doing sum(z) for every while test.

ref <- function (stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- numeric(0)
  sum_z <- 0
  while ( sum_z < stop_sum ) {
    z_i <- runif(1)
    z <- c(z, z_i)
    sum_z <- sum_z + z_i
    }
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(z)                            ## return result
    }
  }

Chunking is necessary to reduce the operational costs of concatenation.

template <- function (chunk_size, stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- vector("list")                    ## store all segments in a list
  sum_z <- 0                             ## cumulative sum
  while ( sum_z < stop_sum ) {
    segmt <- numeric(chunk_size)         ## initialize a segment
    i <- 1
    while (i <= chunk_size) {
      z_i <- runif(1)                    ## call a function & get a value
      sum_z <- sum_z + z_i               ## update cumulative sum
      segmt[i] <- z_i                    ## fill in the segment
      if (sum_z >= stop_sum) break       ## ready to break at any time
      i <- i + 1
      }
    ## grow the list
    if (sum_z < stop_sum) z <- c(z, list(segmt))
    else z <- c(z, list(segmt[1:i]))
    }
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(unlist(z))                    ## return result
    }
  }

Let's check correctness first.

z <- ref(1e+4, FALSE)
z1 <- template(5, 1e+4, FALSE)
z2 <- template(1000, 1e+4, FALSE)

range(z - z1)
#[1] 0 0

range(z - z2)
#[1] 0 0

Let's then compare speed.

## reference implementation
t0 <- ref(1e+4, TRUE)

## unrolling implementation
trial_chunk_size <- seq(5, 1000, by = 5)
tm <- sapply(trial_chunk_size, template, stop_sum = 1e+4, timing = TRUE)

## visualize timing statistics
plot(trial_chunk_size, tm, type = "l", ylim = c(0, t0), col = 2, bty = "l")
abline(h = t0, lwd = 2)

Looks like chunk_size = 200 is sufficiently good, and the speedup factor is

t0 / tm[trial_chunk_size == 200]
#[1] 16.90598

Let's finally see how much time is spent for growing vector with c, via profiling.

Rprof("a.out")
z0 <- ref(1e+4, FALSE)
Rprof(NULL)
summaryRprof("a.out")$by.self
#        self.time self.pct total.time total.pct
#"c"          1.68    90.32       1.68     90.32
#"runif"      0.12     6.45       0.12      6.45
#"ref"        0.06     3.23       1.86    100.00

Rprof("b.out")
z1 <- template(200, 1e+4, FALSE)
Rprof(NULL)
summaryRprof("b.out")$by.self
#        self.time self.pct total.time total.pct
#"runif"      0.10    83.33       0.10     83.33
#"c"          0.02    16.67       0.02     16.67

Adaptive chunk_size with linear growth

ref has O(N * N) operational complexity where N is the length of the final vector. template in principle has O(M * M) complexity, where M = N / chunk_size. To attain linear complexity O(N), chunk_size needs to grow with N, but a linear growth is sufficient: chunk_size <- chunk_size + 1.

template1 <- function (chunk_size, stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- vector("list")                    ## store all segments in a list
  sum_z <- 0                             ## cumulative sum
  while ( sum_z < stop_sum ) {
    segmt <- numeric(chunk_size)         ## initialize a segment
    i <- 1
    while (i <= chunk_size) {
      z_i <- runif(1)                    ## call a function & get a value
      sum_z <- sum_z + z_i               ## update cumulative sum
      segmt[i] <- z_i                    ## fill in the segment
      if (sum_z >= stop_sum) break       ## ready to break at any time
      i <- i + 1
      }
    ## grow the list
    if (sum_z < stop_sum) z <- c(z, list(segmt))
    else z <- c(z, list(segmt[1:i]))
    ## increase chunk_size
    chunk_size <- chunk_size + 1
    }
  ## remove this line if you want
  cat(sprintf("final chunk size = %d\n", chunk_size))
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(unlist(z))                    ## return result
    }
  }

A quick test verifies that we have attained linear complexity.

template1(200, 1e+4)
#final chunk size = 283
#[1] 0.103

template1(200, 1e+5)
#final chunk size = 664
#[1] 1.076

template1(200, 1e+6)
#final chunk size = 2012
#[1] 10.848

template1(200, 1e+7)
#final chunk size = 6330
#[1] 108.183
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!