问题
Can anyone suggest a dplyr
answer to the following question?
Split data.frame by country, and create linear regression model on each subset
For completeness, the question and answer from the link are included below.
Question
For reference, here's Josh's question:
I have a data.frame of data from the World Bank which looks something like this;
country date BirthRate US.
4 Aruba 2011 10.584 25354.8
5 Aruba 2010 10.804 24289.1
6 Aruba 2009 11.060 24639.9
7 Aruba 2008 11.346 27549.3
8 Aruba 2007 11.653 25921.3
9 Aruba 2006 11.977 24015.4
All in all there 70 something sub sets of countries in this data frame that I would like to run a linear regression on. If I use the following I get a nice lm for a single country;
andora = subset(high.sub, country == "Andorra")
andora.lm = lm(BirthRate~US., data = andora)
anova(andora.lm)
summary(andora.lm)
But when I try to use the same type of code in a for loop, I get an error which I'll print below the code;
high.sub = subset(highInc, date > 1999 & date < 2012)
high.sub <- na.omit(high.sub)
highnames <- unique(high.sub$country)
for (i in highnames) {
linmod <- lm(BirthRate~US., data = high.sub, subset = (country == "[i]"))
}
#Error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
If I can get this loop to run I would ideally like to append the coefficients and even better the r-squared values for each model to an empty data.frame. Any help would be greatly appreciated.
Answer
For reference, here's jlhoward's answer (incorporating BondedDust's comment) making use of the *apply functions found in this excellent question: R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate
models <- sapply(unique(as.character(df$country)),
function(cntry)lm(BirthRate~US.,df,subset=(country==cntry)),
simplify=FALSE,USE.NAMES=TRUE)
# to summarize all the models
lapply(models,summary)
# to run anova on all the models
lapply(models,anova)
#This produces a named list of models, so you could extract the model for Aruba as:
models[["Aruba"]]
回答1:
Returning a list from dplyr
is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like
library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))
Then if you want to perform ANOVA on each fitted model, you can do it using rowwise
models %>% rowwise %>% do(anova(.$mod))
but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova)
.
For now (ie until the next version of dplyr
) if you need to store the whole result in a list, you can just use dlply
from plyr
, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d)))
, or of course if you do not absolutely have to use dplyr
you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.
回答2:
Have a look at the lmList
function of the nlme
package:
library(nlme)
lmList(BirthRate ~ US. | country, df)
Here, | country
is used to create a regression for each individual country.
回答3:
Here is one way to use dplyr
. I created Brazil
using your data. So you have identical results for the two countries. You see intercept and slope in the final data frame.
DATA & CODE
structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Aruba", "Brazil"), class = "factor"),
date = c(2011L, 2010L, 2009L, 2008L, 2007L, 2006L, 2011L,
2010L, 2009L, 2008L, 2007L, 2006L), BirthRate = c(10.584,
10.804, 11.06, 11.346, 11.653, 11.977, 10.584, 10.804, 11.06,
11.346, 11.653, 11.977), US. = c(25354.8, 24289.1, 24639.9,
27549.3, 25921.3, 24015.4, 25354.8, 24289.1, 24639.9, 27549.3,
25921.3, 24015.4)), .Names = c("country", "date", "BirthRate",
"US."), class = "data.frame", row.names = c("4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15"))
# country date BirthRate US.
#4 Aruba 2011 10.584 25354.8
#5 Aruba 2010 10.804 24289.1
#6 Aruba 2009 11.060 24639.9
#7 Aruba 2008 11.346 27549.3
#8 Aruba 2007 11.653 25921.3
#9 Aruba 2006 11.977 24015.4
#10 Brazil 2011 10.584 25354.8
#11 Brazil 2010 10.804 24289.1
#12 Brazil 2009 11.060 24639.9
#13 Brazil 2008 11.346 27549.3
#14 Brazil 2007 11.653 25921.3
#15 Brazil 2006 11.977 24015.4
group_by(mydf, country) %>%
do({model = lm(BirthRate ~ US., data = .);
data.frame(int = coef(model)[1], slope = coef(model)[2])})
# country int slope
#1 Aruba 11.02503 8.393295e-06
#2 Brazil 11.02503 8.393295e-06
来源:https://stackoverflow.com/questions/27274523/dplyr-version-of-grouping-a-dataframe-then-creating-regression-model-on-each-gro