sklearn LogisticRegression and changing the default threshold for classification

≡放荡痞女 提交于 2019-11-30 11:39:17

That is not a built-in feature. You can "add" it by wrapping the LogisticRegression class in your own class, and adding a threshold attribute which you use inside a custom predict() method.

However, some cautions:

  1. The default threshold is actually 0. LogisticRegression.decision_function() returns a signed distance to the selected separation hyperplane. If you are looking at predict_proba(), then you are looking at logit() of the hyperplane distance with a threshold of 0.5. But that's more expensive to compute.
  2. By selecting the "optimal" threshold like this, you are utilizing information post-learning, which spoils your test set (i.e., your test or validation set no longer provides an unbiased estimate of out-of-sample error). You may therefore be inducing additional over-fitting unless you choose the threshold inside a cross-validation loop on your training set only, then use it and the trained classifier with your test set.
  3. Consider using class_weight if you have an unbalanced problem rather than manually setting the threshold. This should force the classifier to choose a hyperplane farther away from the class of serious interest.

I would like to give a practical answer

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score

X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_features=20, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = LogisticRegression(class_weight="balanced")
clf.fit(X_train, y_train)
THRESHOLD = 0.25
preds = np.where(clf.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)

pd.DataFrame(data=[accuracy_score(y_test, preds), recall_score(y_test, preds),
                   precision_score(y_test, preds), roc_auc_score(y_test, preds)], 
             index=["accuracy", "recall", "precision", "roc_auc_score"])

By changing the THRESHOLD to 0.25, one can find that recall and precision scores are decreasing. However, by removing the class_weight argument, the accuracy increases but the recall score falls down. Refer to the @accepted answer

Special case: one-dimensional logistic regression

The value separating the regions where a sample X is labeled as 1 and where it is labeled 0 is calculated using the formula:

from scipy.special import logit
thresh = 0.1
val = (logit(thresh)-clf.intercept_)/clf.coef_[0]

Thus, the predictions can be calculated more directly with

preds = np.where(X>val, 1, 0)

For the sake of completeness, I would like to mention another way to elegantly generate predictions based on scikit's probability computations using binarize:

import numpy as np
from sklearn.preprocessing import binarize

THRESHOLD = 0.25

# This probabilities would come from logistic_regression.predict_proba()
y_logistic_prob =  np.random.uniform(size=10)

predictions = binarize(y_logistic_prob.reshape(-1, 1), THRESHOLD).ravel()

Furthermore, I agree with the considerations that Andreus makes, specially 2 and 3. Be sure to keep an eye for them.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!