问题:

I have a huge training data for random forest (dim: 47600811*9). I want to take multiple (let's say 1000) bootstrapped sample of dimension 10000*9 (taking 9000 negative class and 1000 positive class datapoints in each run) and iteratively generate trees for all of them and then combine all those trees into 1 forest. A rough idea of required code is given below. Can anbody guide me how can I generate random sample with replacement from my actual trainData and optimally generate trees for them iteratively? It will be great help. Thanks

library(doSNOW) library(randomForest) cl <- makeCluster(8) registerDoSNOW(cl)  for (i=1:1000){ B <- 1000  U <- 9000  dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]  dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]  subset <- rbind(dataB, dataU)

I am not sure if it is the optimal way of producing a subset again and again (1000 times) from actual trainData.

rf <- foreach(ntree=rep(125, 8), .packages='randomForest') %dopar% {   randomForest(subset[,-1], subset$class, ntree=ntree) } } crf <- do.call('combine', rf) print(crf) stopCluster(cl)

回答1:

Although your example parallelizes the inner rather than the outer loop, it may work reasonably well as long as the inner foreach loop takes more than a few seconds to execute, which it almost certainly does. However, your program does have a bug: it is throwing away the first 999 foreach results and only processing the last result. To fix this, you could preallocate a list of length 1000*8 and assign the results from foreach into it on each iteration of the outer for loop. For example:

library(doSNOW) library(randomForest) trainData <- data.frame(a=rnorm(20), b=rnorm(20),                         class=c(rep("U", 10), rep("B", 10))) n <- 1000         # outer loop count chunksize <- 125  # value of ntree used in inner loop nw <- 8           # number of cluster workers cl <- makeCluster(nw) registerDoSNOW(cl) rf <- vector('list', n * nw) for (i in 1:n) {   B <- 1000   U <- 9000   dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]   dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]   subset <- rbind(dataB, dataU)   ix <- seq((i-1) * nw + 1, i * nw)   rf[ix] <- foreach(ntree=rep(chunksize, nw),                     .packages='randomForest') %dopar% {     randomForest(subset[,-1], subset$class, ntree=ntree)   } } cat(sprintf("# models: %d; expected # models: %d\n", length(rf), n * nw)) cat(sprintf("expected total # trees: %d\n", n * nw * chunksize)) crf <- do.call('combine', rf) print(crf)

This should fix the problem that you mention in the comment that you directed to me.

回答2:

Something like this would work

# Replicate expression 1000 times, store output of each replication in a list # Find indices of class B and sample 9000 times with replacement # Do the same 1000 times for class U. Combine the two vectors of indices  i = replicate(1000, {c(sample(which(trainData$class == "B"), 9000, replace = T), sample(which(trainData$class == "U"), 1000, replace = T))})

Then feed i into a parallel version of lapply

mclapply(i, function(i, ntree) randomForest(trainData[i,-1], trainData[i,]$class, ntree=ntree)

转载请标明出处:Random forest bootstrap training and forest generation

文章来源: Random forest bootstrap training and forest generation

标签

random