Running multiple linear regressions across several columns of a data frame in R

早过忘川 提交于 2019-12-11 08:07:45

问题


I have a dataset structured as such: enter image description here

I would like to run linear regression models and ANOVA using V1, V2...etc. as the independent variables and the g column as the dependent variable in each case (i.e. lm(V1 ~ g), lm(V2 ~ g), and so forth). This would be straightforward except that these linear regressions need to be grouped by level in the pair column, such that, for example, my output contains lm(V1 ~ g) for all rows with pair 1.1 and lm(V1 ~ g) for all pairs 1.201, etc.

I've tried a number of approaches using for loops, lapply and the data.table package, and nothing gives me exactly the output I'd like. Can anyone give me any insight on the best way to tackle this problem?

Edit: My full data set has 7056 different pairs in the pair column and 100 V columns (V1...V100). My latest attempt at this problem:

df$pair <- as.factor(df$pair)
out <- list()
for (i in 3:ncol(df)){
    out[[i]] <- lapply(levels(df$pair), function(x) {
    data.frame(df=x, g = coef(summary(lm(df[,i]~ df$g, data=df[df$pair==x,])),row.names=NULL))})
    }

回答1:


Let's get some tidyverse power here, along with broom, and forgo all these loops...

First I'll make a dummy table:

df <- data.frame(
  g = runif(50), 
  pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE), 
  V1 = runif(50), 
  V2 = runif(50), 
  V3 = runif(50), 
  V4 = runif(50), 
  V5 = runif(50),
  stringsAsFactors = FALSE
)

That's approximately what your data structure looks like. Now onto the meat of the code:

library(tidyverse)
library(broom)

df %>% 
  as_tibble %>% 
  gather(key = "column", value = "value", V1:V5) %>%       # first set the data in long format
  nest(g, value) %>%                                       # now nest the dependent and independent factors
  mutate(model = map(data, ~lm(g ~ value, data = .))) %>%  # fit the model using purrr
  mutate(tidy_model = map(model, tidy)) %>%                # clean the model output with broom
  select(-data, -model) %>%                                # remove the "untidy" parts
  unnest()                                                 # get it back in a recognizable data frame

Which gives us the following:

# A tibble: 30 x 7
   pair  column term        estimate std.error statistic  p.value
   <chr> <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 C     V1     (Intercept)  0.470       0.142    3.31   0.00561 
 2 C     V1     value        0.125       0.265    0.472  0.645   
 3 B     V1     (Intercept)  0.489       0.142    3.45   0.00359 
 4 B     V1     value       -0.0438      0.289   -0.151  0.882   
 5 A     V1     (Intercept)  0.515       0.111    4.63   0.000279
 6 A     V1     value       -0.00569     0.249   -0.0229 0.982   
 7 C     V2     (Intercept)  0.367       0.147    2.50   0.0265  
 8 C     V2     value        0.377       0.300    1.26   0.231   
 9 B     V2     (Intercept)  0.462       0.179    2.59   0.0206  
10 B     V2     value        0.0175      0.322    0.0545 0.957   
# … with 20 more rows

yep, that's a beaut! Note that I used lm(g ~ value) instead of lm(value ~ g) as this is what your text description alluded to.




回答2:


Using the tidyverse package to filter your dataframe:

library(tidyverse)

lm(V1~g, data=filter(yourData, pair==1.1))
lm(V2~g, data=filter(yourData, pair==1.201))

This ensures that you are eliminating rows that don't contain your desired pair value for each regression model. You could probably create a loop to do this, but I think it's easier to just manually keep filtering through the pair values. If you really want to use a loop, here's a fairly simple way to do it:

for (i in levels(yourData$pair)) {
  if (i==1.1) {
    mod1 <- lm(V1~g, data=filter(yourData, pair==i))
  }

  if (i==1.201) {
    mod2 <- lm(V2~g, data=filter(yourData, pair==i))
  }
}

But this is still manually looping through the levels of pair. I would have to see your whole data set to automate the looping process.

Also, if the g column contains your dependent values, the call should be lm(g~V1), lm(g~V2), etc. It shouldn't be lm(V1~g).



来源:https://stackoverflow.com/questions/54907726/running-multiple-linear-regressions-across-several-columns-of-a-data-frame-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!