Bin columns and aggregate data via random sample with replacement for iteratively larger bin sizes

问题

Below is an example matrix:

mat<- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
   2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
   0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
   0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
   1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), nrow=16, ncol=6)
dimnames(mat)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
          c("1", "2", "3", "4", "5", "6"))

I want to group or bin columns and then aggregate data for each group. First, I would like to bin two columns of data. Binned columns must be adjacent to each other (ie. columns 1&2, columns 5&6 NOT columns 4&6). Where the binning starts in the matrix is random. For example, in a matrix of 600 columns the first two columns binned may be columns 545 & 546 and next columns 3&4. I would like to sample without replacement such that a combination is not sampled twice. Aggregation is defined as calculating row sums for the bin rowSums(). Aggregated results will be a new column in a result matrix. The number of columns in the result matrix will be limited to the number of bins randomly sampled.

Bin size continues to get increasingly larger. Next, the bin size increases to 3 such that 3 adjacent columns of data are aggregated. Aggregated data will be put into a different result matrix. This process would continue until the bin is the size of the data frame. All result matrices would be put into a list of matrices.

I have posted a similar question for an alternative binning technique here: Moving window method to aggregate data

I have tried modifying the code so that the binning technique randomly samples n adjacent columns and calculates row sums:

lapply(seq_len(ncol(mat) - 1), function(j) do.call(cbind, 
lapply(sample(ncol(mat)-j, replace = FALSE, size = length(x)), function(i) rowSums(mat[, i:(i + j)]))))

I need help modifying this line of code to randomly sample without replacement i adjacent columns of bin size i for n samples and aggregate each sample using row sums. Note that combinations of columns cannot be resampled but columns can be resampled if they are part of new combinations.

回答1:

Here's an paramaterized approach that samples from possible combinations without replacement and calculates the summary based on the original data, and labels the result columns so you can see where they came from (and have confidence there are not repeats).

set.seed(47)
n_cols_in_bin = 2
n_samps = 4

starting_cols = sample(1:(ncol(mat) -  (n_cols_in_bin - 1)), size = n_samps) 
result = sapply(starting_cols, function(x) rowSums(mat[, x:(x + n_cols_in_bin - 1)]))
colnames(result) = paste0("cols", starting_cols, "to", starting_cols + n_cols_in_bin - 1)
result
#   cols5to6 cols2to3 cols3to4 cols4to5
# a        1        2        0        0
# c        1        0        1        1
# f        0        1        1        0
# h        0        1        1        0
# i        1        2        1        1
# j        0        0        1        1
# l        0        0        0        0
# m        1        0        0        1
# p        1        0        0        0
# q        1        0        0        1
# s        2        0        0        1
# t        2        0        0        0
# u        1        0        0        0
# v        1        0        0        1
# x        0        1        0        0
# z        1        0        0        1

For convenience, we can put it in a function:

foo = function(mat, n_cols_in_bin, n_samps) {
  starting_cols = sample(1:(ncol(mat) -  (n_cols_in_bin - 1)), size = n_samps)
  result = sapply(starting_cols, function(x)
    rowSums(mat[, x:(x + n_cols_in_bin - 1)]))
  colnames(result) = paste0("cols", starting_cols, "to", starting_cols + n_cols_in_bin - 1)
  result
}

foo(mat, n_cols_in_bin = 3, n_samps = 2)
#   cols3to5 cols4to6
# a        0        1
# c        1        2
# f        1        0
# h        1        0
# i        2        1
# j        1        1
# l        0        0
# m        1        1
# p        0        1
# q        1        1
# s        1        2
# t        0        2
# u        0        1
# v        1        1
# x        0        0
# z        1        1

来源：https://stackoverflow.com/questions/58083701/bin-columns-and-aggregate-data-via-random-sample-with-replacement-for-iterativel

标签

loops

aggregate

lapply