How to reduce memory usage within Prado's k-means framework used on big data in R?

问题

I am trying to validate Prado's k-means framework for clustering trading strategies based on returns correlation matrix as found in his paper, using R for a large number of strategies, say 1000.

He tries to find optimal k and optimal initialization for k-means using two for loops over all possible k's and a number of initializations, i.e. k's go from 2 to N-1, where N is number of strategies.

The issue is that running k-means that many times and especially with that many clusters is memory-exhaustive and my computer neither m3.medium AWS instances I have in use are able to do the job. (4 GB RAM both, though on AWS there are less background RAM-consuming processes.)

So, pretty please, any ideas how to handle this memory issue? Or at least how to estimate the memory amount needed as a function of number of strategies used?

I have tried the package biganalytics and its bigkmeans function and it was not enough. I am also aware that there are higher RAM AWS instances, but I would like to be sure my code is optimal before switching to such instance. I have also tried to limit the number of clusters used which confirmed that it is the main memory-consuming issue, but I would like not to stick to such solution (nor in combination with better AWS instance).

The highest number of strategies properly executed on AWS was around 500.

The main part of the code to memory-optimalize is as follows:

D <- nrow(dist)
seq.inits <- rep(1:nr.inits,D-2)
seq.centers <- rep(2:(D-1),each = nr.inits)
KM <- mapply(function(x,y){
  set.seed(x+333)
  kmeans(dist, y)
},seq.inits,seq.centers)

The dist is the strategies' returns' correlation-distance matrix (i.e. the number of columns is equal to number of rows, among other properties), and nr.inits is the number of initializations. Both are input variables. After that the best clustering is determined using silhouette score and possibly reclustered if needed.

I am aware of the fact that distance matrix is not suitable input for k-means and also I am aware of data mining issues, so please do not address these.

My questions as stated above are:

is it possible to reduce memory usage so that I would be able to run 1000 strategies on a m3.medium AWS instance?
is it possible to at least estimate memory usage based on number strategies used? (Assuming I try 2:(N-1) clusters.)

Actually, the second question, preferrably after optimalizing, is more important to me. As I would like to try even a much larger number o f strategies than "just" 1000.

Thanks for your answers beforehand!

回答1:

Not storing all results at the same time applies to many problems, even if you're not using R. Furthermore, I think you're not using kmeans correctly, since it expects your input data, not a cross-distance matrix. Similarly, you don't need to allocate all seq.centers. You mention the silhouette index, which can be computed with cluster::silhouette, so:

library(cluster)
data(ruspini) # sample data included in the cluster package

Since your data doesn't change, you can pre-compute the cross-distance matrix:

dm <- dist(ruspini)

One "iteration" of your desired workflow would be:

km <- kmeans(ruspini, 2) # try 2 clusters
score <- mean(cluster::silhouette(km$cluster, dist = dm)[,3L])

You would like several random starts for the same k clusters:

num_starts <- 2L
scores <- sapply(seq_len(num_starts), function(ignored) {
  km <- kmeans(ruspini, 2)
  mean(cluster::silhouette(km$cluster, dist = dm)[,3L])
})

Note that only the score is saved, without the clustering results. You would also like different values of k:

max_k <- 3L
num_starts <- 2L
scores <- sapply(2L:max_k, function(k) {
  repetitions <- sapply(seq_len(num_starts), function(ignored) {
    km <- kmeans(ruspini, k)
    mean(cluster::silhouette(km$cluster, dist = dm)[,3L])
  })

  max(repetitions)
})

For each value of k, we return only the maximum score across all repetitions (again, saving space by not storing everything).

To make everything reproducible, you use set.seed at the top; using it once is enough for sequential calculations. Maybe you would like to leverage parallelization, but then you might need more RAM (it's quite hard to say how much, because there are many factors at play), and you would need to be careful with reproducibility. If you would like to try it, the final script could look like:

library(doParallel)
library(cluster)

data(ruspini)
dm <- dist(ruspini)

max_k <- 3L
num_starts <- 2L

# get random seeds for each execution
RNGkind("L'Ecuyer")
set.seed(333L)
current_seed <- .Random.seed # initialize
seeds <- lapply(2L:max_k, function(ignored) {
  lapply(seq_len(num_starts), function(also_ignored) {
    seed <- current_seed
    current_seed <<- parallel::nextRNGStream(current_seed)
    # return
    seed
  })
})

workers <- makeCluster(detectCores())
registerDoParallel(workers)

scores <- foreach(k = 2L:max_k, k_seeds = seeds, .combine = c, .packages = "cluster") %dopar% {
  repetitions <- sapply(seq_len(num_starts), function(i) {
    set.seed(k_seeds[[i]])
    km <- kmeans(ruspini, k)
    mean(cluster::silhouette(km$cluster, dist = dm)[,3L])
  })

  max(repetitions)
}

stopCluster(workers); registerDoSEQ(); rm(workers)

names(scores) <- paste0("k_", 2L:max_k)

来源：https://stackoverflow.com/questions/55341728/how-to-reduce-memory-usage-within-prados-k-means-framework-used-on-big-data-in

标签

memory-management

out-of-memory

cluster-analysis

k-means