Are there any Linear Regression Function in SQL Server?

前端未结

关注

 8  1587

遇见更好的自我 2020-12-12 17:19

Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?

8条回答

無奈伤痛 (楼主)

2020-12-12 18:13

I hope the following answer helps one understand where some of the solutions come from. I am going to illustrate it with a simple example, but the generalization to many variables is theoretically straightforward as long as you know how to use index notation or matrices. For implementing the solution for anything beyond 3 variables you'll Gram-Schmidt (See Colin Campbell's answer above) or another matrix inversion algorithm.

Since all the functions we need are variance, covariance, average, sum etc. are aggregation functions in SQL, one can easily implement the solution. I've done so in HIVE to do linear calibration of the scores of a Logistic model - amongst many advantages, one is that you can function entirely within HIVE without going out and back in from some scripting language.

The model for your data (x_1, x_2, y) where your data points are indexed by i, is

y(x_1, x_2) = m_1*x_1 + m_2*x_2 + c

The model appears "linear", but needn't be, For example x_2 can be any non-linear function of x_1, as long as it has no free parameters in it, e.g. x_2 = Sinh(3*(x_1)^2 + 42). Even if x_2 is "just" x_2 and the model is linear, the regression problem isn't. Only when you decide that the problem is to find the parameters m_1, m_2, c such that they minimize the L2 error do you have a Linear Regression problem.

The L2 error is sum_i( (y[i] - f(x_1[i], x_2[i]))^2 ). Minimizing this w.r.t. the 3 parameters (set the partial derivatives w.r.t. each parameter = 0) yields 3 linear equations for 3 unknowns. These equations are LINEAR in the parameters (this is what makes it Linear Regression) and can be solved analytically. Doing this for a simple model (1 variable, linear model, hence two parameters) is straightforward and instructive. The generalization to a non-Euclidean metric norm on the error vector space is straightforward, the diagonal special case amounts to using "weights".

Back to our model in two variables:

y = m_1*x_1 + m_2*x_2 + c

Take the expectation value =>

= m_1* + m_2* + c (0)

Now take the covariance w.r.t. x_1 and x_2, and use cov(x,x) = var(x):

cov(y, x_1) = m_1*var(x_1) + m_2*covar(x_2, x_1) (1)

cov(y, x_2) = m_1*covar(x_1, x_2) + m_2*var(x_2) (2)

These are two equations in two unknowns, which you can solve by inverting the 2X2 matrix.

In matrix form: ... which can be inverted to yield ... where

det = var(x_1)*var(x_2) - covar(x_1, x_2)^2

(oh barf, what the heck are "reputation points? Gimme some if you want to see the equations.)

In any case, now that you have m1 and m2 in closed form, you can solve (0) for c.

I checked the analytical solution above to Excel's Solver for a quadratic with Gaussian noise and the residual errors agree to 6 significant digits.

Contact me if you want to do Discrete Fourier Transform in SQL in about 20 lines.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...