Comparing the GLMNET output of R with Python using LogisticRegression()

问题

I am using Logistic Regression with the L1 norm (LASSO).

I have opted to used the glmnet package in R and the LogisticRegression()from the sklearn.linear_model in python. From my understanding this should give the same results however they are not.

Note that I did not scale my data.

For python I have used the below link as a reference:

https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/

and for R I have used the below link:

http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/?fbclid=IwAR0ZTjoGqRgH5vNum9CloeGVaHdwlqDHwDdoGKJXwncOgIT98qUXGcnV70k

Here is the code used in R

###################################
#### LASSO LOGISTIC REGRESSION ####
##################################
x <- model.matrix(Y~., Train.Data.SubPop)[,-1]
y <- Train.Data.SubPop$Y
lambda_seq = c(0.0001, 0.01, 0.05, 0.0025)

cv_output <- cv.glmnet(x,y,alpha=1, family = "binomial", lambda = lambda_seq)

cv_output$lambda.min

lasso_best <- glmnet(x,y, alpha = 1, family = "binomial", lambda = cv_output$lambda.min)

Below is my Python code:

C = [0.001, 0.01, 0.05, 0.0025]

for c in C:
    clf = LogisticRegression(penalty='l1', C=c, solver='liblinear')
    clf.fit(X_train, y_train)
    print('C:', c)
    print('Coefficient of each feature:', clf.coef_)
    print('Training accuracy:', clf.score(X_train_std, y_train))
    print('Test accuracy:', clf.score(X_test_std, y_test))
    print('')

When I exported the optimal value from the cv.glment() function in R it gave me that the optimal lambda is 0.0001 however, if I look at the analysis from python the best accuracy/precision and recall came from 0.05.

I have tried to fit the model with the 0.05 in R and only 1 non-zero coefficient gave me but in phython I had 7.

can someone help me understand why this discrepancies and difference pleasE?

Also, if someone can guide me how to replicate python code in R it would be very helpful!

回答1:

At a glance I see several issues:

Typo: Looking at your code, in R, your first lambda is 0.0001. In Python, your first C is 0.001.
Different parameterization: Looking at the documentation, I think there's a clue in the names lambda in R and C in Python being different. In glmnet, higher lambda means more shrinkage. However, in the sklearn docs C is described as as "the inverse of regularaization strength... smaller values specify stronger regularization".
Scaling: you say, "Note that I did not scale my data." This is incorrect. In R, you did. There is an glmnet argument standardize for scaling the data, and the default is TRUE. In Python, you didn't.
Use of cross-validation. In R, you use cv.glmnet to do k-fold cross-validation on your training set. In Python, you use LogisticRegression, not LogisticRegressionCV, so there is no cross-validation. Note that cross-validation relies on random sampling, so if you do use CV in both, you should expect the results to be close, but not exact matches.

There are possibly other issues too.

来源：https://stackoverflow.com/questions/57855392/comparing-the-glmnet-output-of-r-with-python-using-logisticregression

标签

python

scikit-learn

glmnet

lasso-regression