h2o.glm lambda search not appearing to iterate over all lambdas

后端未结

关注

 2  485

离开以前 2021-01-14 02:35

Please consider the following basic reproducible example:

library(h2o)
h2o.init()
data(\"iris\")
iris.hex = as.h2o(iris, \"iris.hex\")
mod = h2o.glm(y = \"Se


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   春和景丽
                                             
                
                
                (楼主)
            
              
              
                2021-01-14 03:10
              

            
            
                        
What is happening is it learns from the cross-validation models, to optimize the parameters used for the final run. (BTW, you are using nfolds=2 which is fairly unusual for a small data set: learn on just 75 records, then test on the other 75. So you are going to have a lot of noise in what it learns from CV.)

Following on from your code:

tail(mod@allparameters$lambda)
mod@model$lambda_best


I'm using 3.14.0.1, so here is what I get:

[1] 0.002129615 0.001940426 0.001768044 0.001610975 0.001467861 0.001337460


and:

[1] 0.001610975


Then if we go look at the same for the 2 CV models:

lapply(mod@model$cross_validation_models, function(m_cv){
  m <- h2o.getModel(m_cv$name)
  list( tail(m@allparameters$lambda), m@model$lambda_best )
  })


I get:

[[1]]
[[1]][[1]]
[1] 0.0002283516 0.0002080655 0.0001895815 0.0001727396 0.0001573939 0.0001434115

[[1]][[2]]
[1] 0.002337249


[[2]]
[[2]][[1]]
[1] 0.0002283516 0.0002080655 0.0001895815 0.0001727396 0.0001573939 0.0001434115

[[2]][[2]]
[1] 0.00133746


I.e. it seems the lowest best lambda found in the CV models was 0.00133, so it has used that as early stopping for the final model.

BTW, if you poke around in those cv models you will see they both tried 100 values for lambda. It is only the final model that does the extra optimization.

(I'm thinking of it as a time optimization, but reading p.26/27 of the Generalized Linear Models booklet (free download from https://www.h2o.ai/resources/), I think it is mainly about using the cv data to avoid over-fitting.)

You can explicitly specify a set of lambda values to try. BUT, the cross-validation learning will still take priority for the final model. E.g. in the following the final model only tried the first 4 of the 6 lambda values I suggested, because both CV models liked 0.001 best.

mx = h2o.glm(y = "Sepal.Length", x = setdiff(colnames(iris), "Sepal.Length"), 
            training_frame = iris.hex, nfolds = 2, seed = 100,
            lambda = c(1.0, 0.1, 0.01, 0.001, 0.0001, 0), lambda_search = T,
            family = "gamma")

tail(mx@allparameters$lambda)
mx@model$lambda_best

lapply(mx@model$cross_validation_models, function(m_cv){
  m <- h2o.getModel(m_cv$name)
  list( tail(m@allparameters$lambda), m@model$lambda_best )
})

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复