predict() and newdata - How does this work?

问题

Someone recently posted a question on this paper here: https://static.googleusercontent.com/media/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf

The R code of the paper can be found at the very end of the paper. Essentially, the paper investigates one-month ahead predictions of sales through search queries. I think I understood the model and method, but there's one detail that puzzles me. It's the part:

1 ##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]

2 ##### Fit Model;
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)

and:

3 #### Prediction for the next month;
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE);

I do understand, that dat2 in (1) is only the last row from mdat. (2) means that the regression model is applied to everything but the last row in the dataset.

But why is newdata=dat2 in the prediction model of (3) being used and what does it mean? Why the last row only?

回答1:

Here is a description for each line of the code:

dat1 = mdat[1:(nrow(mdat)-1), ]

Creates a subset of the whole dataset which contains all but the last row.

dat2 = mdat[nrow(mdat), ]

Creates a subset of the whole dataset which contains only the last row.

fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1)

For the model fitting is only the first subset dat1 used. So the data without the last row.

predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)

predict takes the fitted model and looks what it would predict for the "unseen" data dat2.

In the easiest case with only one independent variable we would fit a line to dat1 and then look which Y-value would be predicted for the X-value of dat2.

来源：https://stackoverflow.com/questions/38036874/predict-and-newdata-how-does-this-work

标签

regression

predict