how does sklearn do Linear regression when p >n?

前端 未结 1 1231
遇见更好的自我
遇见更好的自我 2020-12-09 23:41

it\'s known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.

In sklearn I receive this value

相关标签:
1条回答
  • 2020-12-10 00:12

    When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.

    argmin_w l2_norm(w) subject to Xw = y
    

    This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.

    w = np.linalg.pinv(X).dot(y)
    

    The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).

    Check out this example

    import numpy as np
    rng = np.random.RandomState(42)
    X = rng.randn(5, 10)
    y = rng.randn(5)
    
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression(fit_intercept=False)
    coef1 = lr.fit(X, y).coef_
    coef2 = np.linalg.pinv(X).dot(y)
    
    print(coef1)
    print(coef2)
    

    And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

    0 讨论(0)
提交回复
热议问题