Generate an array of regression models without for loop

让人想犯罪 __ 提交于 2019-12-13 11:22:20

问题


I have a data set with columns Y, X1, X2 and V. While Y, X1 and X2 are continuous, V is a categorical variable. Assuming V has 10 categories, I want to create 10 linear regression models and store the results (coefficients, p-values, R-Sq, etc) in another table. Is there a way to do it with data.table without using for loops? Thanks.


回答1:


The base R function by is what you want.

# make up some sample data
dataSet <- data.frame(Y = iris$Sepal.Length, 
                      X1 = iris$Sepal.Width, 
                      X2 = iris$Petal.Length, 
                      V = iris$Species)
# apply the `lm` function by the value of `V`
by(data = dataSet[c("Y","X1","X2")], 
   INDICES = dataSet$V, 
   FUN = lm, 
   formula = Y ~ .)

In the by function, data is the data you want to apply the function to. INDICES is a vector of factors or list of factors with one value corresponding to each row of data indicating how you want the data split up. FUN is the function you want applied to the subsets of your data. In this case, lm() needs the extra parameter formula indicating how you want to model your data, so you can easily pass that as and extra formula parameter in the by function.




回答2:


The broom package exists exactly for this type of problem. It 'tidies' the output of models into neat data frames for easy storage and comparison. Here is an example that uses broom and dplyr to solve a near identical problem. It uses dplyr to group the data by a categorical variable, fits a model to each group, and extracts the coefficients into a data.frame in just a few lines of code. I am unfamiliar with data.table's grouped operation, but it may be possible to perform something similar with the package.

Additionally, broom has the augment function, which can be used to calculate goodness-of-fit metrics and other summary statistics.

Alternatively, if you want to do it without installing additional packages, you could split your data frame into a list (using the split function), lapply the modeling process to the list, extract the results (probably through another lapply that extracts info from the lm object,) and then rbind it all together.



来源:https://stackoverflow.com/questions/39498490/generate-an-array-of-regression-models-without-for-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!