Python's xrange alternative for R OR how to loop over large dataset lazilly?

后端 未结 2 927
心在旅途
心在旅途 2020-11-28 14:41

Following example is based on discussion about using expand.grid with large data. As you can see it ends up with error. I guess this is due to possible combinat

2条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-28 15:38

    One (arguably more "proper") way to approach this would be to write your own iterator for iterators that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)

    This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.

    lazyExpandGrid <- function(...) {
      dots <- list(...)
      sizes <- sapply(dots, length, USE.NAMES = FALSE)
      indices <- c(0, rep(1, length(dots)-1))
      function() {
        indices[1] <<- indices[1] + 1
        DONE <- FALSE
        while (any(rolls <- (indices > sizes))) {
          if (tail(rolls, n=1)) return(FALSE)
          indices[rolls] <<- 1
          indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
        }
        mapply(`[`, dots, indices, SIMPLIFY = FALSE)
      }
    }
    

    Sample usage:

    nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
    nxt()
    #   a  b  c
    # 1 1 15 21
    nxt()
    #   a  b  c
    # 1 2 15 21
    nxt()
    #   a  b  c
    # 1 3 15 21
    nxt()
    #   a  b  c
    # 1 1 16 21
    
    ## 
    
    nxt()
    #   a  b  c
    # 1 3 16 22
    nxt()
    # [1] FALSE
    

    NB: for brevity of display, I used as.data.frame(mapply(...)) for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.

    EDIT

    Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.

    lazyExpandGrid <- function(...) {
      dots <- list(...)
      argnames <- names(dots)
      if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
      sizes <- lengths(dots)
      indices <- cumprod(c(1L, sizes))
      maxcount <- indices[ length(indices) ]
      i <- 0
      function(index) {
        i <<- if (missing(index)) (i + 1L) else index
        if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
        if (i > maxcount || i < 1L) return(FALSE)
        setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L  ),
                 argnames)
      }
    }
    

    It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).

    This last use-case allows for sampling a subset of the design space:

    set.seed(42)
    nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
    as.data.frame(nxt())
    #   a b c d e f
    # 1 1 1 1 1 1 1
    nxt(sample(1e2^6, size=7))
    #      a  b  c  d  e  f
    # 2   69 61  7  7 49 92
    # 21  72 28 55 40 62 29
    # 3   88 32 53 46 18 65
    # 4   88 33 31 89 66 74
    # 5   57 75 31 93 70 66
    # 6  100 86 79 42 78 46
    # 7   55 41 25 73 47 94
    

    Thanks alexis_laz for the improvements of cumprod, Map, and index calculations!

提交回复
热议问题