Different coefficients: scikit-learn vs statsmodels (logistic regression)

坚强是说给别人听的谎言 提交于 2019-12-10 17:47:14

问题


When running a logistic regression, the coefficients I get using statsmodels are correct (verified them with some course material). However, I am unable to get the same coefficients with sklearn. I've tried preprocessing the data to no avail. This is my code:

Statsmodels:

import statsmodels.api as sm

X_const = sm.add_constant(X)
model = sm.Logit(y, X_const)
results = model.fit()
print(results.summary())

The relevant output is:

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -0.2382      3.983     -0.060      0.952      -8.045       7.569
a           2.0349      0.837      2.430      0.015       0.393       3.676
b           0.8077      0.823      0.981      0.327      -0.806       2.421
c           1.4572      0.768      1.897      0.058      -0.049       2.963
d          -0.0522      0.063     -0.828      0.407      -0.176       0.071
e_2         0.9157      1.082      0.846      0.397      -1.205       3.037
e_3         2.0080      1.052      1.909      0.056      -0.054       4.070

Scikit-learn (no preprocessing)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
results = model.fit(X, y)
print(results.coef_)
print(results.intercept_)

The coefficients given are:

array([[ 1.29779008,  0.56524976,  0.97268593, -0.03762884,  0.33646097,
     0.98020901]])

And the intercept/constant given is:

array([ 0.0949539])

As you can see, regardless of which coefficient corresponds to which variable, the numbers given by sklearn don't match the correct ones from statsmodels. What am I missing? Thanks in advance!


回答1:


Thanks to a kind soul on reddit, this was solved. To get the same coefficients, one has to negate the regularisation that sklearn applies to logistic regression by default:

model = LogisticRegression(C=1e8)

Where C according to the documentation is:

C : float, default: 1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.




回答2:


I'm not familiar with statsmodel, but could it be that the .fit() method of this library uses different default arguments compared to sklearn? To verify this, you could try to explicitly set the same corresponding arguments for each .fit() call, and see if you still get different results.



来源:https://stackoverflow.com/questions/50428825/different-coefficients-scikit-learn-vs-statsmodels-logistic-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!