问题
I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.
The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).
My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:
str(dry)
'data.frame': 87 obs. of 22 variables:
$ geneID : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25 30 35 36 38 39 ...
$ RDPI_T1 : num 1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2 : num -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3 : num -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4 : num 0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5 : num -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6 : num 2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7 : num -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8 : num -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9 : num 0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10 : num 1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1 : num 0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2 : num 1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3 : num -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4 : num 0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5 : num 0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6 : num -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7 : num -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8 : num 0.396 NA NA -0.146 -0.468 ...
$ DRY_T9 : num -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10 : num -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num 1.301 3.365 0.458 0.346 1.983 ...
The goal is to run 10 multiple regressions looking like this:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
and so forth iterating through all ten columns for both lists This is equivalent to the following in terms of indexing
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
etc.
My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.
I first defined my variable lists
var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)
This is the loop I tried which doesn't work properly:
lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}
The loop returns this error message:
Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)
I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:
name<-append(name, as.character(colnames(var_list$var1))
Any ideas? Thanks in advance for any help!
UPDATE1: More information on the full data set: My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)
I simulated a dataset as such:
set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))
I then run the for loop:
models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse="
+")),
data = dat3)
}
This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this: (my dataset is called "dry2"):
names(dry2)[grepl("2$", names(dry2))]
This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.
To fix the issue this new code works:
models = list()
for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dry2)
}
回答1:
There are a number of ways to set up the model formulas for iteration. Here's one approach, which we demonstrate using a for loop or map
from the purrr
package for the iteration. Then we use tidy
from the broom
package to get the coefficients and p-values.
library(tidyverse)
library(broom)
# Fake data
set.seed(2)
dat = as.data.frame(replicate(20, runif(20)))
names(dat) = paste0(rep(c("DRY_T","RDPI_T"),each=10), 0:9)
dat = cbind(dat, FITNESS_DRY=runif(20))
# Generate list of models
# Using for loop
models = list()
for(i in 0:9) {
# Get the two column names to use for this iteration of the model
vars = names(dat)[grepl(paste0(i,"$"), names(dat))]
# Fit the model and add results to the output list
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dat)
}
# Same idea using purrr::map to iterate
models = map(0:9 %>% set_names(),
~ {
vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
lm(form, data = dat)
})
# Check first two models
models[1:2]
#> $`0`
#>
#> Call:
#> lm(formula = form, data = dat)
#>
#> Coefficients:
#> (Intercept) DRY_T0 RDPI_T0
#> 0.4543 0.3025 -0.1624
#>
#>
#> $`1`
#>
#> Call:
#> lm(formula = form, data = dat)
#>
#> Coefficients:
#> (Intercept) DRY_T1 RDPI_T1
#> 0.64511 -0.33293 0.06698
# Get coefficients and p-values for each model in a single data frame
results = map_df(models, tidy, .id="run_number")
results
#> # A tibble: 30 x 6
#> run_number term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 0 (Intercept) 0.454 0.153 2.96 0.00872
#> 2 0 DRY_T0 0.303 0.197 1.53 0.143
#> 3 0 RDPI_T0 -0.162 0.186 -0.873 0.395
#> 4 1 (Intercept) 0.645 0.185 3.49 0.00279
#> 5 1 DRY_T1 -0.333 0.204 -1.63 0.122
#> 6 1 RDPI_T1 0.0670 0.236 0.284 0.780
#> 7 2 (Intercept) 0.290 0.147 1.97 0.0650
#> 8 2 DRY_T2 0.270 0.176 1.53 0.144
#> 9 2 RDPI_T2 0.180 0.185 0.972 0.345
#> 10 3 (Intercept) 0.273 0.187 1.46 0.162
#> # … with 20 more rows
Created on 2019-06-28 by the reprex package (v0.2.1)
If you don't need to save the model objects, you can just return the data frame of coefficients and p-values:
results = map_df(0:9 %>% set_names(),
~ {
vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
tidy(lm(form, data = dat))
}, .id="run_number")
UPDATE: In answer to your comment, if you replace all instances of 0:9
with 1:10
(sorry, didn't notice that your column suffixes went from 1:10 rather than 0:9), and all instances of dat
(my fake data) with dry2
(or whatever name you're using for your data frame), the code will run with your data, so long as the column names are the same as the ones you used in your question. If you're using different column names, you'll need to adapt the code, either by hard-coding the new names or by creating a function that can accept whatever column names you're using for the model(s) you're generating.
To explain what the code is doing: First, we need to get the names of the columns we want to use in each iteration of the model. For example, in the for-loop version:
vars = names(dry2)[grepl(paste0(i,"$"), names(dry2))]
When, for example, i=2
, this resolves to:
vars = names(dry2)[grepl("2$", names(dry2))]
vars
[1] "RDPI_T2" "DRY_T2"
So those are the two columns we want to use to generate a regression formula. "2$"
is a regular expression (regular expressions is a string matching language) that means: match values in names(dry2)
that end with the number '2'.
To create our formula we do:
paste(vars, collapse=" + ")
[1] "RDPI_T2 + DRY_T2"
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
form
[1] "FITNESS_DRY ~ RDPI_T2 + DRY_T2"
And now we have our regression formula which we use inside lm
.
Each iteration (either with for
or map
or, in @RomanLuštrik's suggestion, mapply
), generates the successive models.
UPDATE 2: As I noted in the comment, I realized that the regular expression paste(i, "$")
will fail (by matching more than one of each type of independent variable column) when the final number is more than one digit. So, try this instead (and similarly for the map
version):
models = list()
for(i in 1:3772) {
# Get the two column names to use for this iteration of the model
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
# Fit the model and add results to the output list
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dry2)
}
To see how things go wrong in the original version, run, for example, names(dry2)[grepl("2$", names(dry2))]
回答2:
Consider reshaping your very wide data frame to long format with reshape
which is usually the preferred data format of practically any data science application.
For your needs, this requires two reshapes for each _T
metric. After reshaping, create a T_NUM
indicator (i.e., stripping the number of DRY_T##
and RDPI_T##
) and use that along with corresponding FITNESS_DRY
to merge
the two metrics.
Finally, use by
to slice your large data frame by T_NUM groupings to build a list of models. Below uses the dat3 you simulated above. Altogether, all with base R: reshape
-> TNUM <- ...
-> merge
-> by
-> lm
. The other methods, lapply
, within
, and Reduce
are helpers for DRY-er code.
# TWO DATA FRAMES OF FOUR COLUMNS
df_list <- lapply(c("DRY_T", "RDPI_T"), function(i)
within(reshape(dat3[c(grep(i, names(dat3)), ncol(dat3))],
varying = list(names(dat3)[grep(i, names(dat3))]),
v.names = i,
times = names(dat3)[grep(i, names(dat3))],
timevar = "T_NUM",
direction = "long"), {
T_NUM <- as.integer(gsub(i, "", as.character(T_NUM)))
id <- NULL
})
)
# MERGE BOTH DFs
long_df <- Reduce(function(x, y) merge(x, y, by=c("T_NUM", "FITNESS_DRY")), df_list)
head(long_df, 10)
# T_NUM FITNESS_DRY DRY_T RDPI_T
# 1 1 0.1528837 0.9438393 0.87948274
# 2 1 0.1925344 0.7023740 0.65120186
# 3 1 0.2193480 0.2388948 0.29875871
# 4 1 0.2743660 0.1291590 0.60097630
# 5 1 0.2877732 0.9763985 0.66921847
# 6 1 0.3082835 0.7605133 0.22456361
# 7 1 0.5196165 0.1848823 0.79543965
# 8 1 0.5603618 0.1680519 0.08759412
# 9 1 0.5789254 0.8535485 0.37942053
# 10 1 0.6291315 0.5526741 0.43043940
# NAMED LIST OF 3,772 MODELS
model_list <- by(long_df, long_df$T_NUM, function(sub)
lm(FITNESS_DRY ~ DRY_T + RDPI_T, sub))
Output
summary(model_list$`1`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7085512 0.1415849 5.0044269 0.0001085681
# DRY_T -0.1423601 0.1985256 -0.7170867 0.4830577281
# RDPI_T -0.1273237 0.2179249 -0.5842551 0.5667218157
summary(model_list$`2`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.3907525 0.1524423 2.5632809 0.02015115
# DRY_T 0.1952963 0.1990449 0.9811672 0.34026853
# RDPI_T 0.1979513 0.1884085 1.0506492 0.30812662
summary(model_list$`3`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.38836708 0.2076638 1.870172 0.07878049
# DRY_T 0.06995811 0.1965336 0.355960 0.72624947
# RDPI_T 0.27144752 0.2115787 1.282962 0.21672143
...
来源:https://stackoverflow.com/questions/56813352/looping-with-iterations-over-two-lists-of-variables-for-a-multiple-regression-in