Conditional cumsum with reset

后端 未结 4 1405
暗喜
暗喜 2020-12-03 01:54

I have a data frame, the data frame is already sorted as needed, but now I will like to \"slice it\" in groups.

This groups should have a max cumulative value of 10

4条回答
  •  清歌不尽
    2020-12-03 02:42

    The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:

    1) vec: A vector of values that we want to group.

    2) i: The index of the starting position in vec.

    3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.

    # Group a vector based on consecutive values with a cumulative sum <= 10
    gf = function(vec, i, glv) {
    
      ## Break out of the recursion when we get to the last group
      if (sum(vec[i:length(vec)]) <= 10) {
        glv = c(glv, length(i:length(vec)))
        return(glv)
      }
    
      ## Keep recursion going if there are at least two groups left
      # Calculate length of current group
      gl = sum(cumsum(vec[i:length(vec)]) <= 10)
    
      # Append to previous group lengths
      glv.append = c(glv, gl)
    
      # Call function recursively 
      gf(vec, i + gl, glv.append)
    }
    

    Run the function to return a vector of group lengths:

    group_vec = gf(df$value, 1, numeric(0))
    [1] 2 2 2 3 2 3 1
    

    To add a column to df with the group lengths, use rep:

    df$group10 = rep(1:length(group_vec), group_vec)
    

    In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.

    The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.

    I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

提交回复
热议问题