Reducing NbClust memory usage

问题

I need some help with massive usage of memory by the NbClust function. On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:

            if (any(indice == 23) || (indice == 32)) {
                res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1, 
                    md = md)$gamma

Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):

    for (k in 1:nwithin1) {
      s.plus <- s.plus + (colSums(outer(between.dist1, 
                                        within.dist1[k], ">")))
      s.moins <- s.moins + (colSums(outer(between.dist1, 
                                          within.dist1[k], "<")))
      print(s.moins)
    }

I'm guessing that the memory usage comes from the outer() function. Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)? At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.

Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:

set.seed(123)

cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
  name = as.character(cm)
  m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
  mlist[[name]] = m
}

test_data = do.call(cbind, cbind(mlist))

library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30, 
              method = "ward.D2", index = "alllong", alphaBeale = 0.1)

debug: hc <- hclust(md, method = "ward.D2")

It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.

So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.

Edit: I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:

Browse[2]> 
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb

回答1:

If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.

The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?

回答2:

I forked NbClust and made some changes that seem to have made it go for longer without crashing with bigger matrices. I changed some of the functions to use Rfast, propagate and fastcluster. However there are still problems.

I haven't run all my data yet and only run a few tests on dummy data with gap, so there is still time for it to fail. But any suggestions/criticisms would be welcome. My (in progress) fork of NbCluster: https://github.com/jbhanks/NbClust

来源：https://stackoverflow.com/questions/57683324/reducing-nbclust-memory-usage

标签

memory

cluster-analysis

matrix-multiplication