How to write a function in R that will implement the “best subsets” approach to model selection?

问题

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.

The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).

The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.

The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.

Table

I figured out that this code will get me the explanatory and response variables:

  cols <- colnames(data)
  # Get the response variable.
  y <- tail(cols, 1)
  # Get a list of the explanatory variables.
  xs <- head(cols, length(cols) - 1)

I know that I can get a model with something like this (ignore variable names for now):

model <- final_data %>%
  group_by(debt) %>%
  lm(debt ~ distance, data = .) %>%
  glance()

I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.

What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.

For reference here is a very early (not working at all) attempt I made:

best_subsets <- function(data) {
  cols <- colnames(data)
  # Get the response variable.
  y <- tail(cols, 1)
  # Get a list of the explanatory variables.
  xs <- head(cols, length(cols) - 1)

  # Create the formula as a string and then later in the lm function
  # have it turned into a real formula.
  form <- paste(y, "~", xs, sep = " ")
  data %>%
    lm(as.formula(form), data = .) %>%
    glance()
}

回答1:

I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:

library(tidyverse)
library(broom)
library(data.table)

lm_func <- function(df){
  fit1 <- lm(df[, 1] ~ df[, 2], data = df)
  fit2 <- lm(df[, 1] ~ df[, 3], data = df)
  fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
  results <- list(fit1, fit2, fit3)
  names(results) <- paste0("explanitory_variables_", 1:3)
  r_sq <- lapply(results, function(x){
    glance(x)
  })
  r_sq_df <- rbindlist(r_sq, idcol = "df_name")
  r_sq_df

}
lm_func(iris)

This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?

Let me know if this helps at all or if I can refine the answer a little more towards your goal.

UPDATE:

lm_func <- function(df) {
  lst <- c()
  for (i in 2:ncol(df)) {
    ind <- i
    form_df <- df[, 1:ind]
    form <- DF2formula(form_df)
    fit <- lm(form, data = df)
    lst[[i - 1]] <- glance(fit)
  }
  lst
  names(lst) <- paste0("explanitory_variables_", 1:length(lst))
  lst <- rbindlist(lst, idcol = "df_name")
  lst
}
lm_func(iris)

This assumes your first column is y and you want a model for every additional column.

OK one more UPDATE: I think this does everything possible but is probably overkill:

library(combinat)
library(data.table)
library(tidyverse)
library(broom)

#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
  independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
  df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
    cbind(dependent, x)
  })
  df_list
}
permd_df_list <- formula_func(iris) # voila

# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
 mega_lst <- c()
 mega_lst <-  lapply(form_df_list, function(x) {
   lst <- vector(mode = "list", length = length(2:ncol(x)))
   for (i in 2:ncol(x)) {
      ind <- i
      form_df <- x[, 1:ind]
      form <- DF2formula(form_df)
      fit <- lm(form, data = x)
      lst[[i - 1]] <- glance(fit)
      names(lst)[[i-1]] <- deparse(form)
    }
   lst <- rbindlist(lst, idcol = "Model_formula")
   return(lst)
   })
 return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>% 
  rbindlist() %>% 
  distinct()


## You can now subset and select whichever column you want from the final output

I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

来源：https://stackoverflow.com/questions/58945792/how-to-write-a-function-in-r-that-will-implement-the-best-subsets-approach-to

标签

dataframe

tibble