I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different. I understand that it is probably not the same optimization algorithms that are used under the wood.
Specifically, I use the standard Default
dataset (used in the ISL book). The following Python code reads the data into a dataframe Default
.
import pandas as pd # data is available here Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0) # Default['default']=Default['default'].map({'No':0, 'Yes':1}) Default['student']=Default['student'].map({'No':0, 'Yes':1}) # I=Default['default']==0 print("Number of 'default' values :", Default[~I]['balance'].count())
Number of 'default' values : 333.
There is a total of 10000 examples, with only 333 positives
Logistic regression in R
I use the following
library("ISLR") data(Default,package='ISLR') #write.csv(Default,"default.csv") glm.out=glm('default~balance+income+student', family=binomial, data=Default) s=summary(glm.out) print(s) # glm.probs=predict(glm.out,type="response") glm.probs[1:5] glm.pred=ifelse(glm.probs>0.5,"Yes","No") #attach(Default) t=table(glm.pred,Default$default) print(t) score=mean(glm.pred==Default$default) print(paste("score",score))
The result is as follows
Call: glm(formula = "default~balance+income+student", family = binomial, data = Default)
Deviance Residuals: Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 balance 5.737e-03 2.319e-04 24.738 < 2e-16 income 3.033e-06 8.203e-06 0.370 0.71152 studentYes -6.468e-01 2.363e-01 -2.738 0.00619
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom Residual
deviance: 1571.5 on 9996 degrees of freedom AIC: 1579.5
Number of Fisher Scoring iterations: 8
glm.pred No Yes No 9627 228 Yes 40 105
1 "score 0.9732"
I am too lazy to cut and paste the results obtained with statmodels. It suffice to say that they are extremely similar to those given by R.
sklearn
For sklearn, I ran the following code.
- There is a parameter class_weight for taking into account unbalanced classes. I tested class_weight=None (no weightening -- I think that is the default in R), and class_weight='auto' (weightening with the inverse frequencies found inthe data)
- I also put C=10000,the inverse of the regularization parameter, so as to minimize the effect of regularization.
~~
import sklearn from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix features = Default[[ 'balance', 'income' ]] target = Default['default'] # for weight in (None, 'auto'): print("*"*40+"\nweight:",weight) classifier = LogisticRegression(C=10000, class_weight=weight, random_state=42) #C=10000 ~ no regularization classifier.fit(features, target,) #fit classifier on whole base print("Intercept", classifier.intercept_) print("Coefficients", classifier.coef_) y_true=target y_pred_cls=classifier.predict_proba(features)[:,1]>0.5 C=confusion_matrix(y_true,y_pred_cls) score=(C[0,0]+C[1,1])/(C[0,0]+C[1,1]+C[0,1]+C[1,0]) precision=(C[1,1])/(C[1,1]+C[0 ,1]) recall=(C[1,1])/(C[1,1]+C[1,0]) print("\n Confusion matrix") print(C) print() print('{s:{c}<{n}}{num:2.4}'.format(s='Score',n=15,c='', num=score)) print('{s:{c}<{n}}{num:2.4}'.format(s='Precision',n=15,c='', num=precision)) print('{s:{c}<{n}}{num:2.4}'.format(s='Recall',n=15,c='', num=recall))
The results are given below.
> **************************************** >weight: None > >Intercept [ -1.94164126e-06] > >Coefficients [[ 0.00040756 -0.00012588]] > > Confusion matrix > > [[9664 3] > [ 333 0]] > > Score 0.9664 > Precision 0.0 > Recall 0.0 > > **************************************** >weight: auto > >Intercept [-8.15376429] > >Coefficients >[[ 5.67564834e-03 1.95253338e-05]] > > Confusion matrix > > [[8356 1311] > [ 34 299]] > > Score 0.8655 > Precision 0.1857 > Recall 0.8979
What I observe is that for class_weight=None
, the Score is excellent but no positive example is recognized. Precision and recall are at zero. The coefficients found are very small, particularly the intercept. Modifying C does not change things. For class_weight='auto'
things seems better but I still have a precision which is very low (too much positive classified). Again, changing C does not help. If I modify the intercept by hand, I can recover the results given by R. So I suspect that here is a discrepency between the estimation of the intecepts in the two cases. As this has a consequence in the specification of the threeshold (analog to a resampling of pulations), this can explain the differences in performances.
However, I would welcome any advice for the choice between the two solutions and help to understand the origin of these differences. Thanks.