Group vector on conditional sum

后端 未结 2 1273
庸人自扰
庸人自扰 2020-12-17 21:03

I want to group a vector based on the sum of the elements being less than or equal to n. Assume the following,

set.seed(1)
x <- sample(10, 20         


        
相关标签:
2条回答
  • 2020-12-17 21:59

    Here is my Rcpp-solution (close to Khashaa's solution but a bit shorter/stripped down), because you said speed was important, Rcppis probably the way to go:

    # create the data
    set.seed(1)
    x <- sample(10, 20, replace = TRUE)
    y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)
    
    # create the Rcpp function
    library(Rcpp)
    cppFunction('
    IntegerVector sotosGroup(NumericVector x, int cutoff) {
     IntegerVector groupVec (x.size());
     int group = 1;
     double runSum = 0;
     for (int i = 0; i < x.size(); i++) {
      runSum += x[i];
      if (runSum > cutoff) {
       group++;
       runSum = x[i];
      }
      groupVec[i] = group;
     }
     return groupVec;
    }
    ')
    
    # use the function as usual
    y_cpp <- sotosGroup(x, 15)
    sapply(split(x, y_cpp), sum)
    #>  1  2  3  4  5  6  7  8  9 10 
    #> 13 13  9 10 15 12 12 13 14  8
    
    
    all.equal(y, y_cpp)
    #> [1] TRUE
    

    In case anyone needs to be convinced by the speed:

    # Speed Benchmarks
    library(data.table)
    library(microbenchmark)
    dt <- data.table(x)
    
    frank <- function(DT, n = 15) {
     DT[, xc := cumsum(x)]
     b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
     z = 1; res = z
     while (!is.na(z)) 
      res <- c(res, z <- b[z])
     DT[, g := cumsum(.I %in% res)][]
    }
    
    microbenchmark(
     frank(dt),
     sotosGroup(x, 15),
     times = 100
    )
    #> Unit: microseconds
    #>               expr      min       lq       mean    median       uq       max neval cld
    #>          frank(dt) 1720.589 1831.320 2148.83096 1878.0725 1981.576 13728.830   100   b
    #>  sotosGroup(x, 15)    2.595    3.962    6.47038    7.5035    8.290    11.579   100  a
    
    0 讨论(0)
  • 2020-12-17 22:07

    This works, but can probably be improved:

    x <- c(3L, 4L, 6L, 10L, 3L, 9L, 10L, 7L, 7L, 1L, 3L, 2L, 7L, 4L, 8L, 5L, 8L, 10L, 4L, 8L)
    y <- as.integer(c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10))
    n = 15
    library(data.table)
    DT = data.table(x,y)
    DT[, xc := cumsum(x)]
    b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
    z = 1; res = logical(length(x))
    while (!is.na(z) && z <= length(x)){ 
        res[z] <- TRUE 
        z <- b[z]
    }
    DT[, g := cumsum(res)]
         x  y  xc  g
     1:  3  1   3  1
     2:  4  1   7  1
     3:  6  1  13  1
     4: 10  2  23  2
     5:  3  2  26  2
     6:  9  3  35  3
     7: 10  4  45  4
     8:  7  5  52  5
     9:  7  5  59  5
    10:  1  5  60  5
    11:  3  6  63  6
    12:  2  6  65  6
    13:  7  6  72  6
    14:  4  7  76  7
    15:  8  7  84  7
    16:  5  8  89  8
    17:  8  8  97  8
    18: 10  9 107  9
    19:  4  9 111  9
    20:  8 10 119 10
    
    DT[, all(y == g)] # TRUE
    

    How it works

    The rolling join asks "if this is the start of a group, where will the next one start?" Then you can iterate over the result, starting from the first position, to find all the groups.

    The last line DT[, g := cumsum(res)] could also be done as a rolling join (maybe faster?):

    DT[, g := data.table(r = which(res))[, g := .I][.(.I), on=.(r), roll=TRUE, x.g ]]
    
    0 讨论(0)
提交回复
热议问题