Analysis using linear regression based on subgroups

ε祈祈猫儿з 提交于 2019-11-28 09:52:11

问题


Assume I have data (t,y), where I expect a linear dependency y(t). Furthermore, there exist attributes to each observation par1, par2, par3. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10) but was not able to get the formula for the best fit.

The final result should look like this if plotted. For data see below.


Thus, I want the information

  • Adding par1and par2 gives the best fit
  • The models are y_i = a_i * t_i + b_i with given a_i and b_i

Reproducible example:

t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset

# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
  set.seed(33*i)
  d[[i]] <- sort(sample(t, 50, replace = F))
  set.seed(33*i)
  noise <- rnorm(10)
  y[[i]] <- a[i]*d[[i]] + b[i] + noise
  y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]

# That is what the data is looking like:
plot(mydata$t, mydata$y)

# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
     main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")

Comment and question on @Roland's answer:

I understand that that with the given three parameters there are 2^3=8 groups with 2*3*3=18 factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".

I tried the following

g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)

and realized that the fit is off. Looking back this is clear because

  • I include the wrong factor levels (most likely parameter 3 is not relevant)
  • and thus get the wrong data for the fit

So my last question is:

  • Where can I find the relevant groups included in the best model and
  • what are the corresponding fit parameters from the regression?

Sorry, if this was obvious but to me it is mystery


回答1:


The LASSO can come pretty close (although it identifies still too many effects):

#I assume these are supposed to be factors:
mydata$par1 <- factor(mydata$par1)
mydata$par2 <- factor(mydata$par2)
mydata$par3 <- factor(mydata$par3)

#create model matrix, remove intercept since glmnet adds it
x <- model.matrix(y ~ (par1 * par2 * par3) * t, data = mydata)[,-1]

#cross-validated LASSO
library(glmnet)
set.seed(42)
fit <- cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
plot(fit)

coef <- as.matrix(coef(fit, s = "lambda.1se"))
coef[coef != 0,]
#(Intercept)      par230           t     par12:t    par230:t   par3300:t 
# 0.47542479 -0.27612966  0.75497711 -0.42493030 -0.15044371  0.03033057

#The groups:
mydata$g <- factor((mydata$par2 == 30) + 10 * (mydata$par1 == 2) + 100 * (mydata$par3 == 300))



mydata$pred.1se <- predict(fit, newx = x, s = "lambda.1se")

library(ggplot2)
ggplot(mydata, aes(x = t, color = g)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred.1se))

You can then calculate the desired intercepts and slopes from the coefficients.



来源:https://stackoverflow.com/questions/40765869/analysis-using-linear-regression-based-on-subgroups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!