Difference in SGD classifier results and statsmodels results for logistic with l1

问题

As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?

Much bigger differences on my own dataset, but still pretty large using mtcars:

 df = sm.datasets.get_rdataset("mtcars", "datasets").data

 y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)

 logit = sm.Logit(y, X).fit_regularized(alpha=.0035)

 clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
                n_iter=1000, fit_intercept=False)
 clf.fit(X, y)

gives:

sklearn: [-3.79663192 -1.16145654  0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894  3.33574042 -7.50604097 -3.15087396]

回答1:

I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression

clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)

gives very similar to l1 penalized Logit.

array([[-7.27275526, -2.52638167,  3.32801895, -7.50119041, -3.14198402]])

来源：https://stackoverflow.com/questions/26246127/difference-in-sgd-classifier-results-and-statsmodels-results-for-logistic-with-l

标签

scikit-learn

statsmodels