问题
As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?
Much bigger differences on my own dataset, but still pretty large using mtcars:
df = sm.datasets.get_rdataset("mtcars", "datasets").data
y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)
logit = sm.Logit(y, X).fit_regularized(alpha=.0035)
clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
n_iter=1000, fit_intercept=False)
clf.fit(X, y)
gives:
sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]
回答1:
I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression
clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)
gives very similar to l1 penalized Logit.
array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])
来源:https://stackoverflow.com/questions/26246127/difference-in-sgd-classifier-results-and-statsmodels-results-for-logistic-with-l