How can I parallelize combn()?

The function combn() generates all combinations of the elements of x taken m at a time. It is very fast and efficient for nCm small (where n is the number of elements of x) but it quickly runs out of memory. For example:

> combn(c(1:50), 12, simplify = TRUE)
Error in matrix(r, nrow = len.r, ncol = count) : 
invalid 'ncol' value (too large or NA)

I would like to know if the function combn() can be modified such that it generates only k chosen combinations. Let's call this new function chosencombn(). Then we would have:

> combn(c("a", "b", "c", "d"), m=2)
     [,1] [,2] [,3] [,4] [,5] [,6]
 [1,] "a"  "a"  "a"  "b"  "b"  "c" 
 [2,] "b"  "c"  "d"  "c"  "d"  "d" 

>chosencombn(c("a", "b", "c", "d"), m=2, i=c(1,4,6))
     [,1] [,2] [,3]
 [1,] "a"  "b"  "c" 
 [2,] "b"  "c"  "d"

>chosencombn(c("a", "b", "c", "d"), m=2, i=c(4,5))
     [,1] [,2]
 [1,] "b"  "b" 
 [2,] "c"  "d"

I understand that such a function would require the use of an ordering of the combinations such that one can find the place of a given combination immediately. Does such an ordering exist? Can it be coded to obtain a function as efficient as combn()?

Package "trotter" is useful for this as it does not keep the permutations in memory.

library(trotter)

combs = cpv(2, c("a", "b", "c", "d"))
sapply(c(1, 4, 6), function(i) combs[i])
#     [,1] [,2] [,3]
#[1,] "a"  "b"  "c" 
#[2,] "b"  "c"  "d"

To get a sense of how combn orders its output, let's look at the output of combn(1:5, 3):

combn(1:5, 3)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    1    1    1    1    1    1    2    2    2     3
# [2,]    2    2    2    3    3    4    3    3    4     4
# [3,]    3    4    5    4    5    5    4    5    5     5

There is a lot of structure here. First, all columns are ordered as you go downward, and the first row is non-decreasing. The columns starting with 1 have combn(2:5, 2) below them; the columns starting with 2 have combn(3:5, 2) below them; and so on.

Let's now think of how to construct column number 8. The approach I would take to reconstruct would be to determine the first element of that column (due to the relationship above there are choose(4, 2)=6 columns starting with 1, choose(3, 2)=3 columns starting with 2, and choose(2, 2)=1 column starting with 3). In our case we determine that we start with 2, since columns 7-9 must start with 2.

To determine the second and subsequent elements of the column, we repeat the process with a smaller number of elements (since 2 is our first element, we're now selecting from elements 3-5), a new position (we're selecting column number 8-6=2 that begins with a 2), and a new number of remaining elements to select (we need 3-1=2 more elements).

getcombn below is an iterative formulation that does just this:

getcombn <- function(x, m, pos) {
  combo <- rep(NA, m)
  start <- 1
  for (i in seq_len(m-1)) {
    end.pos <- cumsum(choose((length(x)-start):(m-i), m-i))
    selection <- which.max(end.pos >= pos)
    start <- start + selection
    combo[i] <- x[start - 1]
    pos <- pos - c(0, end.pos)[selection]
  }
  combo[m] <- x[start + pos - 1]
  combo
}

chosencombn <- function(x, m, all.pos) {
  sapply(all.pos, function(pos) getcombn(x, m, pos))
}
chosencombn(c("a", "b", "c", "d"), 2, c(1,4,6))
#     [,1] [,2] [,3]
# [1,] "a"  "b"  "c" 
# [2,] "b"  "c"  "d" 
chosencombn(c("a", "b", "c", "d"), 2, c(4,5))
#     [,1] [,2]
# [1,] "b"  "b" 
# [2,] "c"  "d"

This enables you to compute particular columns in cases where it would be impossible to enumerate all the combinations (you would run out of memory). For instance, with 50 options, the number of ways to select 25 elements is a 14-digit number, so enumerating all combinations is probably not an option. Still, you can compute specific indicated combinations:

chosencombn(1:50, 25, c(1, 1000000L, 1e14))
#       [,1] [,2] [,3]
#  [1,]    1    1    3
#  [2,]    2    2    4
#  [3,]    3    3    6
#  [4,]    4    4    7
#  [5,]    5    5    8
#  [6,]    6    6   11
#  [7,]    7    7   14
#  [8,]    8    8   15
#  [9,]    9    9   17
# [10,]   10   10   20
# [11,]   11   11   22
# [12,]   12   12   25
# [13,]   13   13   27
# [14,]   14   14   30
# [15,]   15   15   31
# [16,]   16   16   32
# [17,]   17   17   33
# [18,]   18   18   36
# [19,]   19   20   37
# [20,]   20   23   39
# [21,]   21   27   40
# [22,]   22   39   42
# [23,]   23   42   47
# [24,]   24   45   48
# [25,]   25   49   50

来源：https://stackoverflow.com/questions/35301986/how-can-i-parallelize-combn

标签

parallel-processing

combinations

combinatorics