Different result with roc_auc_score() and auc()

前端未结

关注

 3  1236

挽巷

I have trouble understanding the difference (if there is one) between roc_auc_score() and auc() in scikit-learn.

Im tying to predict a bina

相关标签:

3条回答

一生所求

2020-12-07 12:39
predict returns only one class or the other. Then you compute a ROC with the results of predict on a classifier, there are only three thresholds (trial all one class, trivial all the other class, and in between). Your ROC curve looks like this:
```
      ..............................
      |
      |
      |
......|
|
|
|
|
|
|
|
|
|
|
|
```
Meanwhile, predict_proba() returns an entire range of probabilities, so now you can put more than three thresholds on your data.
```
             .......................
             |
             |
             |
          ...|
          |
          |
     .....|
     |
     |
 ....|
.|
|
|
|
|
```
Hence different areas.
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人共我

2020-12-07 12:51

AUC is not always area under the curve of a ROC curve. Area Under the Curve is an (abstract) area under some curve, so it is a more general thing than AUROC. With imbalanced classes, it may be better to find AUC for a precision-recall curve.

See sklearn source for roc_auc_score:

def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
    # <...> docstring <...>
    def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
            # <...> bla-bla <...>

            fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                            sample_weight=sample_weight)
            return auc(fpr, tpr, reorder=True)

    return _average_binary_score(
        _binary_roc_auc_score, y_true, y_score, average,
        sample_weight=sample_weight)

As you can see, this first gets a roc curve, and then calls auc() to get the area.

I guess your problem is the predict_proba() call. For a normal predict() the outputs are always the same:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score

est = LogisticRegression(class_weight='auto')
X = np.random.rand(10, 2)
y = np.random.randint(2, size=10)
est.fit(X, y)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict(X))
print auc(false_positive_rate, true_positive_rate)
# 0.857142857143
print roc_auc_score(y, est.predict(X))
# 0.857142857143

If you change the above for this, you'll sometimes get different outputs:

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict_proba(X)[:,1])
# may differ
print auc(false_positive_rate, true_positive_rate)
print roc_auc_score(y, est.predict(X))

0 讨论(0)

野趣味

2020-12-07 12:57
When you use the y_pred (class labels), you already decided on the threshold. When you use y_prob (positive class probability) you are open to the threshold, and the ROC Curve should help you decide the threshold.

For the first case you are using the probabilities:
```
y_probs = clf.predict_proba(xtest)[:,1]
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_probs)
auc(fp_rate, tp_rate)
```
When you do that, you're considering the AUC 'before' taking a decision on the threshold you'll be using.

In the second case, you are using the prediction (not the probabilities), in that case, use 'predict' instead of 'predict_proba' for both and you should get the same result.
```
y_pred = clf.predict(xtest)
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_pred)
print auc(fp_rate, tp_rate)
# 0.857142857143

print roc_auc_score(y, y_pred)
# 0.857142857143
```
0 讨论(0)
发布评论:

提交评论
- 加载中...