How to use expand.grid values to run various model hyperparameter combinations for ranger in R

我与影子孤独终老i 提交于 2021-01-29 08:32:02

问题


I've seen various posts on how to select the independent variables for a model by using expand.grid and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.

library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris)  # let's assume these are different input tables

I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger) for my list of input tables. I do the following to set up the grid:

hyper_grid <- expand.grid(
  Input_table = names(Input_list),
  Trees = c(10, 20),
  Importance = c("none", "impurity"),
  Classification = TRUE,
  Repeats = 1:5,
  Target = "Species")

> head(hyper_grid)
  Input_table Trees Importance Classification Repeats  Target
1       iris1    10       none           TRUE       1 Species
2       iris2    10       none           TRUE       1 Species
3       iris1    20       none           TRUE       1 Species
4       iris2    20       none           TRUE       1 Species
5       iris1    10   impurity           TRUE       1 Species
6       iris2    10   impurity           TRUE       1 Species

My question is, what is the best way to pass this values to the model? Currently I'm using a for loop:

for (i in 1:nrow(hyper_grid)) {
  RF_train <- ranger(
    dependent.variable.name = hyper_grid[i, "Target"], 
    data = Input_list[[hyper_grid[i, "Input_table"]]],  # referring to the named object in the list
    num.trees = hyper_grid[i, "Trees"], 
    importance = hyper_grid[i, "Importance"], 
    classification = hyper_grid[i, "Classification"])  # otherwise regression is performed
  print(RF_train)
}

iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply for this role? This way of iterating also results in messy function calls:

Call:
 ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i,      "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i,      "Importance"], classification = hyper_grid[i, "Classification"])

Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix) and write the results into an extended version of the hyper_grid on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.

Thank you!


回答1:


I think it is cleanest to wrap the training and extraction of the values you need into a function. The dots (...) are needed for usage with the purrr::pmap function below.

fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
  RF_train <- ranger(
    dependent.variable.name = Target, 
    data = Input_list[[Input_table]],  # referring to the named object in the list
    num.trees = Trees, 
    importance = Importance, 
    classification = Classification)  # otherwise regression is performed

  data.frame(Prediction_error = RF_train$prediction.error,
             True_positive = RF_train$confusion.matrix[1])
}

Then you can add the results as a column by mapping over the rows using for example purrr::pmap:

hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)

By mapping in this way, the function is applied row by row, so you should not run into memory issues.

The result of purrr::pmap is a list, which means that the column res contains a list for every row. This can be unnested using tidyr::unnest to spread the elements of that list across your data frame.

tidyr::unnest(hyper_grid, res)

I think this approach is very elegant, but it requires some tidyverse knowledge. I highly recommend this book if you want to know more about that. Chapter 25 (Many models) describes an approach similar to the one I'm taking here.



来源:https://stackoverflow.com/questions/60945003/how-to-use-expand-grid-values-to-run-various-model-hyperparameter-combinations-f

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!