问题
I have a huge dataset and would like to perform a rolling linear regression over a window of 60. However, I want that only the 60 previous values are considered for the linear regression.
My Dataframe DF consists of following Columns:
Date Company Y X1 X2
01.01.2015 Mill 0.13 -1 -3
01.02.2015 Mill 0.16 1 5
01.03.2015 Mill 0.83 3 4
01.04.2015 Mill -0.83 23 4
01.01.1988 Hall 0.23 1 3
01.02.1988 Hall 0.24 23 2
01.03.1988 Hall 0.78 19 -9
01.04.1988 Hall 0.73 4 12
01.05.1988 Hall 0.72 5 12
01.11.2008 Jopo 0.12 0.9 32
01.12.2008 Jopo 0.13 10 32
01.01.2009 Jopo 0.32 0.2 10
01.02.2009 Jopo 0.32 2 -1
I have several thousand companies and data for several months for each company. The regression has to be done for every month of a company, with the rolling window of 60 previous months of this specific company.
In the given example, assuming only a rolling window of 3, I want for company Mill a regression on 01.04.2015 with the data from 01.01-01.03-2015. For company Hall I want regressions on 01.04 and 01.05.1988, and for Jopo I want a regression on 01.02.2009.
Ideally, the results will be pasted together with Company and Date in a new data frame, as I have to keep working with this data and have to analyse it more.
Following code should do the trick for the rolling regression, however it does not use the previous 60 dates, but 59 and includes the current date too:
library(zoo)
rolled <- function(df) {
rollapply(df, width = 60,
FUN = function(z) coef(lm(Y ~ X1+X2, data = as.data.frame(z))),
by.column = FALSE, align = "right"
)
}
Following code does the regression dependent on the company name, as I want to make regressions for each individual company, independend from the other companies.
Test <- do.call("rbind", by(DF[c("Y", "X1", "X2")], DF[c( "Name")], rolled))
How do I incorporate, that only the 60 previous values are used for the regression? And maybe someone knows how to show also "Company" and "Date" in the results? Thanks for your help!
回答1:
Assume DF
is as given reproducibly in the Note at the end. Use by
to split DF
into company rows and apply the anonymous function using rollapplyr
. Note that rollapplyr
can take for the width
a list argument with the offsets of the positions to use. For example, list(-seq(3))
means use the 3 prior rows (as suggested in the question) but not the current row (which would have position 0).
library(zoo)
# w <- 60
w <- 3
Coef <- function(x) coef(lm(as.data.frame(x)))
do.call("rbind", by(DF, DF$Company, function(x)
cbind(x, rollapplyr(x[3:5], list(-seq(w)), Coef, fill = NA, by.column = FALSE))))
giving:
Date Company Y X1 X2 (Intercept) X1 X2
Hall.5 01.01.1988 Hall 0.23 1.0 3 NA NA NA
Hall.6 01.02.1988 Hall 0.24 23.0 2 NA NA NA
Hall.7 01.03.1988 Hall 0.78 19.0 -9 NA NA NA
Hall.8 01.04.1988 Hall 0.73 4.0 12 0.37711 -0.0017480 -0.0484553
Hall.9 01.05.1988 Hall 0.72 5.0 12 1.30333 -0.0433333 -0.0333333
Jopo.10 01.11.2008 Jopo 0.12 0.9 32 NA NA NA
Jopo.11 01.12.2008 Jopo 0.13 10.0 32 NA NA NA
Jopo.12 01.01.2009 Jopo 0.32 0.2 10 NA NA NA
Jopo.13 01.02.2009 Jopo 0.32 2.0 -1 0.41104 0.0010989 -0.0091259
Mill.1 01.01.2015 Mill 0.13 -1.0 -3 NA NA NA
Mill.2 01.02.2015 Mill 0.16 1.0 5 NA NA NA
Mill.3 01.03.2015 Mill 0.83 3.0 4 NA NA NA
Mill.4 01.04.2015 Mill -0.83 23.0 4 0.21611 0.2994444 -0.0711111
You could also try this:
library(broom)
fun <- function(x) unlist(tidy(lm(as.data.frame(x)))[, -1])
do.call("rbind", by(DF, DF$Company, function(x)
cbind(x, rollapplyr(x[3:5], list(-(seq(w))), fun, fill = NA, by.column = FALSE))))
which gives:
Date Company Y X1 X2 estimate1 estimate2 estimate3
Hall.5 01.01.1988 Hall 0.23 1.0 3 NA NA NA
Hall.6 01.02.1988 Hall 0.24 23.0 2 NA NA NA
Hall.7 01.03.1988 Hall 0.78 19.0 -9 NA NA NA
Hall.8 01.04.1988 Hall 0.73 4.0 12 0.3771138 -0.001747967 -0.048455285
Hall.9 01.05.1988 Hall 0.72 5.0 12 1.3033333 -0.043333333 -0.033333333
Jopo.10 01.11.2008 Jopo 0.12 0.9 32 NA NA NA
Jopo.11 01.12.2008 Jopo 0.13 10.0 32 NA NA NA
Jopo.12 01.01.2009 Jopo 0.32 0.2 10 NA NA NA
Jopo.13 01.02.2009 Jopo 0.32 2.0 -1 0.4110390 0.001098901 -0.009125874
Mill.1 01.01.2015 Mill 0.13 -1.0 -3 NA NA NA
Mill.2 01.02.2015 Mill 0.16 1.0 5 NA NA NA
Mill.3 01.03.2015 Mill 0.83 3.0 4 NA NA NA
Mill.4 01.04.2015 Mill -0.83 23.0 4 0.2161111 0.299444444 -0.071111111
std.error1 std.error2 std.error3 statistic1 statistic2 statistic3
Hall.5 NA NA NA NA NA NA
Hall.6 NA NA NA NA NA NA
Hall.7 NA NA NA NA NA NA
Hall.8 NaN NaN NaN NaN NaN NaN
Hall.9 NaN NaN NaN NaN NaN NaN
Jopo.10 NA NA NA NA NA NA
Jopo.11 NA NA NA NA NA NA
Jopo.12 NA NA NA NA NA NA
Jopo.13 NaN NaN NaN NaN NaN NaN
Mill.1 NA NA NA NA NA NA
Mill.2 NA NA NA NA NA NA
Mill.3 NA NA NA NA NA NA
Mill.4 NaN NaN NaN NaN NaN NaN
p.value1 p.value2 p.value3
Hall.5 NA NA NA
Hall.6 NA NA NA
Hall.7 NA NA NA
Hall.8 NaN NaN NaN
Hall.9 NaN NaN NaN
Jopo.10 NA NA NA
Jopo.11 NA NA NA
Jopo.12 NA NA NA
Jopo.13 NaN NaN NaN
Mill.1 NA NA NA
Mill.2 NA NA NA
Mill.3 NA NA NA
Mill.4 NaN NaN NaN
>
Alternative
Another possibility is to use a width of w+1
and then remove the last component.
# w <- 60
w <- 3
Coef1 <- function(x) coef(lm(as.data.frame(head(x, -1))))
do.call("rbind", by(DF, DF$Company, function(x)
cbind(x, rollapplyr(x[3:5], w+1, Coef1, fill = NA, by.column = FALSE))))
Fewer than w+1 rows in a company
If there are companies with fewer than w+1 rows then try this. It uses the partial=TRUE
argument of rollapplyr
to compute lm
with fewer rows and modifies Coef
accordingly so that it will continue to work:
# w <- 60
w <- 3
Coef <- function(x) coef(lm(as.data.frame(matrix(x, c(nrow(x), 1)))))
do.call("rbind", by(DF, DF$Company, function(x) cbind(x,
rollapplyr(x[3:5], list(-seq(w)), Coef, partial = TRUE, by.column = FALSE))))
Note: Input DF
is:
Lines <- "Date Company Y X1 X2
01.01.2015 Mill 0.13 -1 -3
01.02.2015 Mill 0.16 1 5
01.03.2015 Mill 0.83 3 4
01.04.2015 Mill -0.83 23 4
01.01.1988 Hall 0.23 1 3
01.02.1988 Hall 0.24 23 2
01.03.1988 Hall 0.78 19 -9
01.04.1988 Hall 0.73 4 12
01.05.1988 Hall 0.72 5 12
01.11.2008 Jopo 0.12 0.9 32
01.12.2008 Jopo 0.13 10 32
01.01.2009 Jopo 0.32 0.2 10
01.02.2009 Jopo 0.32 2 -1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
来源:https://stackoverflow.com/questions/42584748/linear-regression-with-only-previous-values-in-moving-window