xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

后端 未结 3 1457
梦毁少年i
梦毁少年i 2020-12-22 20:20

I\'ve been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to

3条回答
  •  一向
    一向 (楼主)
    2020-12-22 20:50

    This is a good question and great reply from silo with lots of details! I found it very helpful for someone new to xgboost like me. Thank you. The method to randomize and compared to boundary is very inspiring. Good to use and good to know. Now in 2018 some slight revise are needed, for example, early.stop.round should be early_stopping_rounds. The output mdcv is organized slightly differently:

      min_rmse_index  <-  mdcv$best_iteration
      min_rmse <-  mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
    

    And depends on the application (linear, logistic,etc...), the objective, eval_metric and parameters shall be adjusted accordingly.

    For the convenience of anyone who is running a regression, here is the slightly adjusted version of code (most are the same as above).

    library(xgboost)
    # Matrix for xgb: dtrain and dtest, "label" is the dependent variable
    dtrain <- xgb.DMatrix(X_train, label = Y_train)
    dtest <- xgb.DMatrix(X_test, label = Y_test)
    
    best_param <- list()
    best_seednumber <- 1234
    best_rmse <- Inf
    best_rmse_index <- 0
    
    set.seed(123)
    for (iter in 1:100) {
      param <- list(objective = "reg:linear",
                    eval_metric = "rmse",
                    max_depth = sample(6:10, 1),
                    eta = runif(1, .01, .3), # Learning rate, default: 0.3
                    subsample = runif(1, .6, .9),
                    colsample_bytree = runif(1, .5, .8), 
                    min_child_weight = sample(1:40, 1),
                    max_delta_step = sample(1:10, 1)
      )
      cv.nround <-  1000
      cv.nfold <-  5 # 5-fold cross-validation
      seed.number  <-  sample.int(10000, 1) # set seed for the cv
      set.seed(seed.number)
      mdcv <- xgb.cv(data = dtrain, params = param,  
                     nfold = cv.nfold, nrounds = cv.nround,
                     verbose = F, early_stopping_rounds = 8, maximize = FALSE)
    
      min_rmse_index  <-  mdcv$best_iteration
      min_rmse <-  mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
    
      if (min_rmse < best_rmse) {
        best_rmse <- min_rmse
        best_rmse_index <- min_rmse_index
        best_seednumber <- seed.number
        best_param <- param
      }
    }
    
    # The best index (min_rmse_index) is the best "nround" in the model
    nround = best_rmse_index
    set.seed(best_seednumber)
    xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
    
    # Check error in testing data
    yhat_xg <- predict(xg_mod, dtest)
    (MSE_xgb <- mean((yhat_xg - Y_test)^2))
    

提交回复
热议问题