问题
Someone recently posted a question on this paper here: https://static.googleusercontent.com/media/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf
The R code of the paper can be found at the very end of the paper. Essentially, the paper investigates one-month ahead predictions of sales through search queries. I think I understood the model and method, but there's one detail that puzzles me. It's the part:
1 ##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
2 ##### Fit Model;
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)
and:
3 #### Prediction for the next month;
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE);
I do understand, that dat2
in (1) is only the last row from mdat
. (2) means that the regression model is applied to everything but the last row in the dataset.
But why is newdata=dat2
in the prediction model of (3) being used and what does it mean? Why the last row only?
回答1:
Here is a description for each line of the code:
dat1 = mdat[1:(nrow(mdat)-1), ]
Creates a subset of the whole dataset which contains all but the last row.
dat2 = mdat[nrow(mdat), ]
Creates a subset of the whole dataset which contains only the last row.
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1)
For the model fitting is only the first subset dat1
used. So the data without the last row.
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
predict
takes the fitted model and looks what it would predict for the "unseen" data dat2
.
In the easiest case with only one independent variable we would fit a line to dat1
and then look which Y-value would be predicted for the X-value of dat2
.
来源:https://stackoverflow.com/questions/38036874/predict-and-newdata-how-does-this-work