linear regression using lm() - surprised by the result

半城伤御伤魂 提交于 2019-12-01 03:47:16

Try this:

reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]

sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18

The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.

Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.

In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:

lad_reg <- rq(y1 ~ x1)

(Pro tip: use lwd to make your graphs much more readable)

What gets even closer to what you had in mind is Total Least Squares, as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your sample:

v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)

You got a nice answer already, but maybe this is also helpful:

As you know, OLS minimizes the sum of squared errors in y-direction. This implies that the uncertainty of your x-values is negligible, which is often the case. But possibly it's not the case for your data. If we assume that uncertainties in x and y are equal and do Deming regression we get a fit more similar to what you expected.

dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")

You don't provide detailed information about your data. Thus, this might be useful or not.
