looping with iterations over two lists of variables for a multiple regression in R

一笑奈何 提交于 2020-01-04 02:20:16

问题


I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.

The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).

My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:

str(dry)
'data.frame':   87 obs. of  22 variables:
$ geneID     : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25    30 35 36 38 39 ...
$ RDPI_T1    : num  1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2    : num  -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3    : num  -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4    : num  0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5    : num  -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6    : num  2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7    : num  -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8    : num  -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9    : num  0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10   : num  1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1     : num  0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2     : num  1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3     : num  -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4     : num  0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5     : num  0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6     : num  -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7     : num  -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8     : num  0.396 NA NA -0.146 -0.468 ...
$ DRY_T9     : num  -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10    : num  -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num  1.301 3.365 0.458 0.346 1.983 ...

The goal is to run 10 multiple regressions looking like this:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)

and so forth iterating through all ten columns for both lists This is equivalent to the following in terms of indexing

lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])

etc.

My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.

I first defined my variable lists

var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)

This is the loop I tried which doesn't work properly:

lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
    for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}

The loop returns this error message:

Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)

I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:

name<-append(name, as.character(colnames(var_list$var1))

Any ideas? Thanks in advance for any help!

UPDATE1: More information on the full data set: My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)

I simulated a dataset as such:

set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))

I then run the for loop:

models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" 
+")),
                                 data = dat3)
}

This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
0 (non-NA) cases

UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this: (my dataset is called "dry2"):

names(dry2)[grepl("2$", names(dry2))]

This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.

To fix the issue this new code works:

models = list()

for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" +   ")),
data = dry2)
}

回答1:


There are a number of ways to set up the model formulas for iteration. Here's one approach, which we demonstrate using a for loop or map from the purrr package for the iteration. Then we use tidy from the broom package to get the coefficients and p-values.

library(tidyverse)
library(broom)

# Fake data
set.seed(2)
dat = as.data.frame(replicate(20, runif(20)))
names(dat) = paste0(rep(c("DRY_T","RDPI_T"),each=10), 0:9)
dat = cbind(dat, FITNESS_DRY=runif(20))

# Generate list of models

# Using for loop
models = list()

for(i in 0:9) {

  # Get the two column names to use for this iteration of the model
  vars = names(dat)[grepl(paste0(i,"$"), names(dat))]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dat)
}

# Same idea using purrr::map to iterate
models = map(0:9 %>% set_names(), 
             ~ {
               vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
               form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
               lm(form, data = dat)
             })
# Check first two models
models[1:2]
#> $`0`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T0      RDPI_T0  
#>      0.4543       0.3025      -0.1624  
#> 
#> 
#> $`1`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T1      RDPI_T1  
#>     0.64511     -0.33293      0.06698
# Get coefficients and p-values for each model in a single data frame
results = map_df(models, tidy, .id="run_number")

results
#> # A tibble: 30 x 6
#>    run_number term        estimate std.error statistic p.value
#>    <chr>      <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#>  1 0          (Intercept)   0.454      0.153     2.96  0.00872
#>  2 0          DRY_T0        0.303      0.197     1.53  0.143  
#>  3 0          RDPI_T0      -0.162      0.186    -0.873 0.395  
#>  4 1          (Intercept)   0.645      0.185     3.49  0.00279
#>  5 1          DRY_T1       -0.333      0.204    -1.63  0.122  
#>  6 1          RDPI_T1       0.0670     0.236     0.284 0.780  
#>  7 2          (Intercept)   0.290      0.147     1.97  0.0650 
#>  8 2          DRY_T2        0.270      0.176     1.53  0.144  
#>  9 2          RDPI_T2       0.180      0.185     0.972 0.345  
#> 10 3          (Intercept)   0.273      0.187     1.46  0.162  
#> # … with 20 more rows

Created on 2019-06-28 by the reprex package (v0.2.1)

If you don't need to save the model objects, you can just return the data frame of coefficients and p-values:

results = map_df(0:9 %>% set_names(), 
            ~ {
              vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
              form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
              tidy(lm(form, data = dat))
            }, .id="run_number")

UPDATE: In answer to your comment, if you replace all instances of 0:9 with 1:10 (sorry, didn't notice that your column suffixes went from 1:10 rather than 0:9), and all instances of dat (my fake data) with dry2 (or whatever name you're using for your data frame), the code will run with your data, so long as the column names are the same as the ones you used in your question. If you're using different column names, you'll need to adapt the code, either by hard-coding the new names or by creating a function that can accept whatever column names you're using for the model(s) you're generating.

To explain what the code is doing: First, we need to get the names of the columns we want to use in each iteration of the model. For example, in the for-loop version:

vars = names(dry2)[grepl(paste0(i,"$"), names(dry2))]

When, for example, i=2, this resolves to:

vars = names(dry2)[grepl("2$", names(dry2))]
vars
[1] "RDPI_T2" "DRY_T2"

So those are the two columns we want to use to generate a regression formula. "2$" is a regular expression (regular expressions is a string matching language) that means: match values in names(dry2) that end with the number '2'.

To create our formula we do:

paste(vars, collapse=" + ")
[1] "RDPI_T2 + DRY_T2"
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
form
[1] "FITNESS_DRY ~  RDPI_T2 + DRY_T2"

And now we have our regression formula which we use inside lm.

Each iteration (either with for or map or, in @RomanLuštrik's suggestion, mapply), generates the successive models.

UPDATE 2: As I noted in the comment, I realized that the regular expression paste(i, "$") will fail (by matching more than one of each type of independent variable column) when the final number is more than one digit. So, try this instead (and similarly for the map version):

models = list()

for(i in 1:3772) {

  # Get the two column names to use for this iteration of the model
  vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dry2)
}

To see how things go wrong in the original version, run, for example, names(dry2)[grepl("2$", names(dry2))]




回答2:


Consider reshaping your very wide data frame to long format with reshape which is usually the preferred data format of practically any data science application.

For your needs, this requires two reshapes for each _T metric. After reshaping, create a T_NUM indicator (i.e., stripping the number of DRY_T## and RDPI_T##) and use that along with corresponding FITNESS_DRY to merge the two metrics.

Finally, use by to slice your large data frame by T_NUM groupings to build a list of models. Below uses the dat3 you simulated above. Altogether, all with base R: reshape -> TNUM <- ... -> merge -> by -> lm. The other methods, lapply, within, and Reduce are helpers for DRY-er code.

# TWO DATA FRAMES OF FOUR COLUMNS
df_list <- lapply(c("DRY_T", "RDPI_T"), function(i)
  within(reshape(dat3[c(grep(i, names(dat3)), ncol(dat3))],
                 varying = list(names(dat3)[grep(i, names(dat3))]),
                 v.names = i,
                 times = names(dat3)[grep(i, names(dat3))],
                 timevar = "T_NUM",
                 direction = "long"), {
           T_NUM <- as.integer(gsub(i, "", as.character(T_NUM)))
           id <- NULL
  })
)

# MERGE BOTH DFs
long_df <- Reduce(function(x, y) merge(x, y, by=c("T_NUM", "FITNESS_DRY")), df_list)

head(long_df, 10)
#    T_NUM FITNESS_DRY     DRY_T     RDPI_T
# 1      1   0.1528837 0.9438393 0.87948274
# 2      1   0.1925344 0.7023740 0.65120186
# 3      1   0.2193480 0.2388948 0.29875871
# 4      1   0.2743660 0.1291590 0.60097630
# 5      1   0.2877732 0.9763985 0.66921847
# 6      1   0.3082835 0.7605133 0.22456361
# 7      1   0.5196165 0.1848823 0.79543965
# 8      1   0.5603618 0.1680519 0.08759412
# 9      1   0.5789254 0.8535485 0.37942053
# 10     1   0.6291315 0.5526741 0.43043940

# NAMED LIST OF 3,772 MODELS
model_list <- by(long_df, long_df$T_NUM, function(sub) 
                  lm(FITNESS_DRY ~ DRY_T + RDPI_T, sub))

Output

summary(model_list$`1`)$coefficients
#               Estimate Std. Error    t value     Pr(>|t|)
# (Intercept)  0.7085512  0.1415849  5.0044269 0.0001085681
# DRY_T       -0.1423601  0.1985256 -0.7170867 0.4830577281
# RDPI_T      -0.1273237  0.2179249 -0.5842551 0.5667218157

summary(model_list$`2`)$coefficients
#              Estimate Std. Error   t value   Pr(>|t|)
# (Intercept) 0.3907525  0.1524423 2.5632809 0.02015115
# DRY_T       0.1952963  0.1990449 0.9811672 0.34026853
# RDPI_T      0.1979513  0.1884085 1.0506492 0.30812662

summary(model_list$`3`)$coefficients
#               Estimate Std. Error  t value   Pr(>|t|)
# (Intercept) 0.38836708  0.2076638 1.870172 0.07878049
# DRY_T       0.06995811  0.1965336 0.355960 0.72624947
# RDPI_T      0.27144752  0.2115787 1.282962 0.21672143

...


来源:https://stackoverflow.com/questions/56813352/looping-with-iterations-over-two-lists-of-variables-for-a-multiple-regression-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!