i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface
glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates
theta, its variance-covariance matrix can be estimated as
H is the Hessian matrix of log-likelihood at
theta. This is exactly what the function below does:
import numpy as np from scipy.stats import norm from sklearn.linear_model import LogisticRegression def logit_pvalue(model, x): """ Calculate z-scores for scikit-learn LogisticRegression. parameters: model: fitted sklearn.linear_model.LogisticRegression with intercept and large C x: matrix on which the model was fit This function uses asymtptics for maximum likelihood estimates. """ p = model.predict_proba(x) n = len(p) m = len(model.coef_) + 1 coefs = np.concatenate([model.intercept_, model.coef_]) x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1)) ans = np.zeros((m, m)) for i in range(n): ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0] vcov = np.linalg.inv(np.matrix(ans)) se = np.sqrt(np.diag(vcov)) t = coefs/se p = (1 - norm.cdf(abs(t))) * 2 return p # test p-values x = np.arange(10)[:, np.newaxis] y = np.array([0,0,0,1,0,0,1,1,1,1]) model = LogisticRegression(C=1e30).fit(x, y) print(logit_pvalue(model, x)) # compare with statsmodels import statsmodels.api as sm sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0) print(sm_model.pvalues) sm_model.summary()
The outputs of
print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978] [ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.