Caret - Setting the seeds inside the gafsControl()

一世执手 提交于 2019-12-08 09:45:39

问题


I am trying to set the seeds inside the caret's gafsControl(), but I am getting this error:

Error in { : task 1 failed - "supplied seed is not a valid integer"

I understand that seeds for trainControl() is a vector equal to the number of resamples plus one, with the number of combinations of models's tuning parameters (in my case 36, SVM with 6 Sigma and 6 Cost values) in each (resamples) entries. However, I couldn't figure out what I should use for gafsControl(). I've tried iters*popSize (100*10), iters (100), popSize (10), but none has worked.

Thanks in advance.

here is my code (with simulated data):

library(caret)
library(doMC)
library(kernlab)

registerDoMC(cores=32)

set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)

mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)

## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)

#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)

#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#

## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)


gaCtrl <- gafsControl(functions = mylogGA,
                      method = "cv",
                      number = 3,
                      metric = c(internal = "logLoss",
                                 external = "logLoss"),
                      verbose = TRUE,
                      maximize = c(internal = FALSE,
                                   external = FALSE),
                      index = ga_index,
                      seeds = ga_seeds,
                      allowParallel = TRUE)

tCtrl = trainControl(method = "cv", 
                     number = 5,
                     classProbs = TRUE,
                     summaryFunction = mnLogLoss,
                     index = tr_index,
                     seeds = tr_seeds,
                     allowParallel = FALSE)


svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))

t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
                      y = train.set$Class,
                      gafsControl = gaCtrl,
                      trControl = tCtrl,
                      popSize = 10,
                      iters = 100,
                      method = "svmRadial",
                      preProc = c("center", "scale"),
                      tuneGrid = svmGrid,
                      metric="logLoss",
                      maximize = FALSE)

t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)

save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")

Session Info:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
 [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         magrittr_1.5        splines_3.2.2        MASS_7.3-43         munsell_0.4.2      
 [6] colorspace_1.2-6    foreach_1.4.2       minqa_1.2.4         car_2.0-26          stringr_1.0.0      
 [11] plyr_1.8.3          tools_3.2.2         parallel_3.2.2      pbkrtest_0.4-2      nnet_7.3-10        
 [16] grid_3.2.2          gtable_0.1.2        nlme_3.1-122        mgcv_1.8-7          quantreg_5.18      
 [21] MatrixModels_0.4-1  iterators_1.0.7     gtools_3.5.0        lme4_1.1-9          digest_0.6.8       
 [26] Matrix_1.2-2        nloptr_1.0.4        reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5      
 [31] compiler_3.2.2      BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.2        SparseM_1.7        
 [36] brglm_0.5-9         proto_0.3-10       
> 

回答1:


I am not so familiar with the gafsControl() function that you mention, but I encountered a very similar issue when setting parallel seeds using trainControl(). In the instructions, it describes how to create a list (length = number of resamples + 1), where each item is a list (length = number of parameter combinations to test). I find that doing that does not work (see topepo/caret issue #248 for info). However, if you then turn each item into a vector, e.g.

seeds <- lapply(seeds, as.vector)

then the seeds seem to work (i.e. models and predictions are entirely reproducible). I should clarify that this is using doMC as the backend. It may be different for other parallel backends.

Hope this helps




回答2:


I was able to figure out my mistake by inspecting gafs.default. The seeds inside gafsControl() takes a vector with length (n_repeats*nresampling)+1 and not a list (as in trainControl$seeds). It is actually stated in the documentation of ?gafsControl that seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one. I figured it out the hard way, this is a reminder to carefully read the documentation :D.

    if (!is.null(gafsControl$seeds)) {
        if (length(gafsControl$seeds) < length(gafsControl$index) + 
            1) 
            stop(paste("There must be at least", length(gafsControl$index) + 
            1, "random number seeds passed to gafsControl"))
    }
    else {
        gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) + 
        1)
    }

So, the proper way to set my ga_seeds is:

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)



回答3:


If that way settings seeds you can ensure each run the same feature subset is selected ? I ams asking due randominess of GA



来源:https://stackoverflow.com/questions/32494744/caret-setting-the-seeds-inside-the-gafscontrol

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!