Creating groups of equal sum in R

孤人 提交于 2020-08-05 06:25:30

问题


I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.

The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:

test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)

for(i in 1:100000){

  store <- store + test$x[i]

  if(store < total/3){

    test$y[i] <- 1

  } else {

      if(store < 2*total/3){

        test$y[i] <- 2

      } else { 

        test$y[i] <- 3

      }     
  }    
}

While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).

  • I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
  • This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
  • Requires pre-ordering of the column. Might not be able to get around this one.

As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.


回答1:


Maybe with cumsum:

test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1



回答2:


This is more or less a bin-packing problem.

Use the binPack function from the BBmisc package:

library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)

The sums of the 3 bins are nearly identical:

tapply(test$x, test$bins, sum)


    1          2          3 
1666683334 1666683334 1666683332



回答3:


I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:

> tapply(test$x, test$z, sum)
         1          2          3 
1666636245 1666684180 1666729575 
> sum(test)/3
[1] 1666683333

So I though I would first create a random permutation and offer something similar:

 test$x <- sample(test$x)
 test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x), 
                                        c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
     91099     116379     129539 
1666676164 1666686837 1666686999 

This also achieves a more even distribution of counts:

> table(test$z2)

 91099 116379 129539 
 33245  33235  33520 
> table(test$z)

    1     2     3 
57734 23915 18351 

I must admit to puzzlement regarding the naming of the entries in z2.




回答4:


You can use fold() from groupdata2 and get an almost equal number of elements per group:

# Create data frame
test <- data.frame(x = as.numeric(1:100000)) 

# Use fold() to create 3 numerically balanced groups  
test <- groupdata2::fold(k = 3, num_col = "x")

# Watch first 10 rows
head(test, 10)

## # A tibble: 10 x 2
## # Groups:   .folds [3]
##        x .folds
##    <dbl> <fct> 
##  1     1 1     
##  2     2 3     
##  3     3 2     
##  4     4 1     
##  5     5 2     
##  6     6 2     
##  7     7 1     
##  8     8 3     
##  9     9 2     
## 10    10 3 

# Check the sum and number of elements per group
test %>% 
  dplyr::group_by(.folds) %>% 
  dplyr::summarize(sum_ = sum(x),
                   n_members = dplyr::n())

## # A tibble: 3 x 3
##   .folds       sum_ n_members
##   <fct>       <dbl>     <int>
## 1 1      1666690952     33333
## 2 2      1666716667     33334
## 3 3      1666642381     33333



回答5:


Or you can just cut on the cumsum

test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3) 

or use ggplot2::cut_interval instead of cut:

test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3) 


来源:https://stackoverflow.com/questions/29424130/creating-groups-of-equal-sum-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!