Different coefficients: scikit-learn vs statsmodels (logistic regression)

问题

When running a logistic regression, the coefficients I get using statsmodels are correct (verified them with some course material). However, I am unable to get the same coefficients with sklearn. I've tried preprocessing the data to no avail. This is my code:

Statsmodels:

import statsmodels.api as sm

X_const = sm.add_constant(X)
model = sm.Logit(y, X_const)
results = model.fit()
print(results.summary())

The relevant output is:

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -0.2382      3.983     -0.060      0.952      -8.045       7.569
a           2.0349      0.837      2.430      0.015       0.393       3.676
b           0.8077      0.823      0.981      0.327      -0.806       2.421
c           1.4572      0.768      1.897      0.058      -0.049       2.963
d          -0.0522      0.063     -0.828      0.407      -0.176       0.071
e_2         0.9157      1.082      0.846      0.397      -1.205       3.037
e_3         2.0080      1.052      1.909      0.056      -0.054       4.070

Scikit-learn (no preprocessing)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
results = model.fit(X, y)
print(results.coef_)
print(results.intercept_)

The coefficients given are:

array([[ 1.29779008,  0.56524976,  0.97268593, -0.03762884,  0.33646097,
     0.98020901]])

And the intercept/constant given is:

array([ 0.0949539])

As you can see, regardless of which coefficient corresponds to which variable, the numbers given by sklearn don't match the correct ones from statsmodels. What am I missing? Thanks in advance!

回答1:

Thanks to a kind soul on reddit, this was solved. To get the same coefficients, one has to negate the regularisation that sklearn applies to logistic regression by default:

model = LogisticRegression(C=1e8)

Where C according to the documentation is:

C : float, default: 1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

回答2:

I'm not familiar with statsmodel, but could it be that the .fit() method of this library uses different default arguments compared to sklearn? To verify this, you could try to explicitly set the same corresponding arguments for each .fit() call, and see if you still get different results.

来源：https://stackoverflow.com/questions/50428825/different-coefficients-scikit-learn-vs-statsmodels-logistic-regression

标签

python

scikit-learn

logistic-regression

statsmodels