Generating a very large matrix of string combinations using combn() and bigmemory package

让人想犯罪 __ 提交于 2019-11-29 02:26:37

You could first find all 2-way combinations, and then just combine them with the 3d value while saving them every time. This takes a lot less memory:

combn.mod <- function(x,fname){
  tmp <- combn(x,2,simplify=F)
  n <- length(x)
  for ( i in x[-c(n,n-1)]){
    # Drop all combinations that contain value i
    id <- which(!unlist(lapply(tmp,function(t) i %in% t)))
    tmp <- tmp[id]
    # add i to all other combinations and write to file
    out <- do.call(rbind,lapply(tmp,c,i))
    write(t(out),file=fname,ncolumns=3,append=T,sep=",")
  }
}

combn.mod(x,"F:/Tmp/Test.txt")

This is not as general as Joshua's answer though, it is specifically for your case. I guess it is faster -again, for this particular case-, but I didn't make the comparison. Function works on my computer using little over 50 Mb (roughly estimated) when applied to your x.

EDIT

On a sidenote: If this is for simulation purposes, I find it hard to believe that any scientific application needs 400+ million simulation runs. You might be asking the correct answer to the wrong question here...

PROOF OF CONCEPT :

I changed the write line by tt[[i]]<-out, added tt <- list() before the loop and return(tt) after it. Then:

> do.call(rbind,combn.mod(letters[1:5]))
      [,1] [,2] [,3]
 [1,] "b"  "c"  "a" 
 [2,] "b"  "d"  "a" 
 [3,] "b"  "e"  "a" 
 [4,] "c"  "d"  "a" 
 [5,] "c"  "e"  "a" 
 [6,] "d"  "e"  "a" 
 [7,] "c"  "d"  "b" 
 [8,] "c"  "e"  "b" 
 [9,] "d"  "e"  "b" 
[10,] "d"  "e"  "c" 

Here's a function I've written in R, which currently finds its (unexported) home in the LSPM package. You give it the total number of items n, the number of items to select r, and the index of the combination you want i; it returns the values in 1:n corresponding to combination i.

".combinadic" <- function(n, r, i) {

  # http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx
  # http://en.wikipedia.org/wiki/Combinadic

  if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(n-r)!")

  largestV <- function(n, r, i) {
    #v <- n-1
    v <- n                                  # Adjusted for one-based indexing
    #while(choose(v,r) > i) v <- v-1
    while(choose(v,r) >= i) v <- v-1        # Adjusted for one-based indexing
    return(v)
  }

  res <- rep(NA,r)
  for(j in 1:r) {
    res[j] <- largestV(n,r,i)
    i <- i-choose(res[j],r)
    n <- res[j]
    r <- r-1
  }
  res <- res + 1
  return(res)
}

It allows you to generate each combination based on the value of the lexicographic index:

> .combinadic(1344, 3, 1)
[1] 3 2 1
> .combinadic(1344, 3, 2)
[1] 4 2 1
> .combinadic(1344, 3, 403716544)
[1] 1344 1343 1342

So you just need to loop over 1:403716544 and append the results to a file. It may take awhile, but it's at least feasible (see Dirk's answer). You also may need to do it in several loops, since the vector 1:403716544 will not fit in memory on my machine.

Or you could just port the R code to C/C++ and do the looping / writing there, since it would be a lot faster.

At a first approximation, every algorithm trades off storage for speed.

You have hit a boundary trying to preallocate your fully enumerated combination matrix. So maybe you should try not to preallocate this matrix but to try, say,

  1. If you think you need the combinations, calculate them somewhere else and store them in a simple db (or, heck, flat file) and look them up -- 9 gb saved

  2. Take advantage of open source, read the code to combn() and modify it into a client-server thingy: given a call with index number N, it will loop and return the Nth entry. Not efficient, but possibly more easily feasible.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!