问题
I am using Logistic Regression with the L1 norm (LASSO).
I have opted to used the glmnet package in R and the LogisticRegression()from the sklearn.linear_model in python. From my understanding this should give the same results however they are not.
Note that I did not scale my data.
For python I have used the below link as a reference:
https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/
and for R I have used the below link:
http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/?fbclid=IwAR0ZTjoGqRgH5vNum9CloeGVaHdwlqDHwDdoGKJXwncOgIT98qUXGcnV70k
Here is the code used in R
###################################
#### LASSO LOGISTIC REGRESSION ####
##################################
x <- model.matrix(Y~., Train.Data.SubPop)[,-1]
y <- Train.Data.SubPop$Y
lambda_seq = c(0.0001, 0.01, 0.05, 0.0025)
cv_output <- cv.glmnet(x,y,alpha=1, family = "binomial", lambda = lambda_seq)
cv_output$lambda.min
lasso_best <- glmnet(x,y, alpha = 1, family = "binomial", lambda = cv_output$lambda.min)
Below is my Python code:
C = [0.001, 0.01, 0.05, 0.0025]
for c in C:
clf = LogisticRegression(penalty='l1', C=c, solver='liblinear')
clf.fit(X_train, y_train)
print('C:', c)
print('Coefficient of each feature:', clf.coef_)
print('Training accuracy:', clf.score(X_train_std, y_train))
print('Test accuracy:', clf.score(X_test_std, y_test))
print('')
When I exported the optimal value from the cv.glment() function in R it gave me that the optimal lambda is 0.0001 however, if I look at the analysis from python the best accuracy/precision and recall came from 0.05.
I have tried to fit the model with the 0.05 in R and only 1 non-zero coefficient gave me but in phython I had 7.
can someone help me understand why this discrepancies and difference pleasE?
Also, if someone can guide me how to replicate python code in R it would be very helpful!
回答1:
At a glance I see several issues:
Typo: Looking at your code, in R, your first
lambdais0.0001. In Python, your firstCis0.001.Different parameterization: Looking at the documentation, I think there's a clue in the names
lambdain R andCin Python being different. Inglmnet, higher lambda means more shrinkage. However, in the sklearn docsCis described as as "the inverse of regularaization strength... smaller values specify stronger regularization".Scaling: you say, "Note that I did not scale my data." This is incorrect. In R, you did. There is an
glmnetargumentstandardizefor scaling the data, and the default isTRUE. In Python, you didn't.Use of cross-validation. In R, you use
cv.glmnetto do k-fold cross-validation on your training set. In Python, you useLogisticRegression, notLogisticRegressionCV, so there is no cross-validation. Note that cross-validation relies on random sampling, so if you do use CV in both, you should expect the results to be close, but not exact matches.
There are possibly other issues too.
来源:https://stackoverflow.com/questions/57855392/comparing-the-glmnet-output-of-r-with-python-using-logisticregression