Rolling regression and prediction with lm() and predict()

前端 未结 2 1707
旧时难觅i
旧时难觅i 2021-01-06 17:39

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:

2条回答
  •  庸人自扰
    2021-01-06 17:41

    (Efficient) solution

    This is what you can do:

    p <- 3  ## number of parameters in lm()
    n <- nrow(dat) - 1
    
    ## a function to return what you desire for subset dat[1:x, ]
    bundle <- function(x) {
      fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
      pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
      c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
      }
    
    ## rolling regression / prediction
    result <- t(sapply(p:n, bundle))
    colnames(result) <- c("adj.r2", "prediction", "se")
    

    Note I have done several things inside the bundle function:

    • I have used subset argument for selecting a subset to fit
    • I have used model = FALSE to not save model frame hence we save workspace

    Overall, there is no obvious loop, but sapply is used.

    • Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
    • Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.

    Test

    Example data (with 30 "observations")

    dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
                      v12 = runif(30, 1, 100))
    

    Applying code above gives results (27 rows in total, truncated output for 5 rows)

                adj.r2 prediction        se
     [1,]          NaN   3.881068       NaN
     [2,]  0.106592619   3.676821 0.7517040
     [3,]  0.545993989   3.892931 0.2758347
     [4,]  0.622612495   3.766101 0.1508270
     [5,]  0.180462206   3.996344 0.2059014
    

    The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.

提交回复
热议问题