Different result with roc_auc_score() and auc()

前端 未结 3 1230
挽巷
挽巷 2020-12-07 12:07

I have trouble understanding the difference (if there is one) between roc_auc_score() and auc() in scikit-learn.

Im tying to predict a bina

相关标签:
3条回答
  • 2020-12-07 12:39

    predict returns only one class or the other. Then you compute a ROC with the results of predict on a classifier, there are only three thresholds (trial all one class, trivial all the other class, and in between). Your ROC curve looks like this:

          ..............................
          |
          |
          |
    ......|
    |
    |
    |
    |
    |
    |
    |
    |
    |
    |
    |
    

    Meanwhile, predict_proba() returns an entire range of probabilities, so now you can put more than three thresholds on your data.

                 .......................
                 |
                 |
                 |
              ...|
              |
              |
         .....|
         |
         |
     ....|
    .|
    |
    |
    |
    |
    

    Hence different areas.

    0 讨论(0)
  • 2020-12-07 12:51

    AUC is not always area under the curve of a ROC curve. Area Under the Curve is an (abstract) area under some curve, so it is a more general thing than AUROC. With imbalanced classes, it may be better to find AUC for a precision-recall curve.

    See sklearn source for roc_auc_score:

    def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
        # <...> docstring <...>
        def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
                # <...> bla-bla <...>
    
                fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                                sample_weight=sample_weight)
                return auc(fpr, tpr, reorder=True)
    
        return _average_binary_score(
            _binary_roc_auc_score, y_true, y_score, average,
            sample_weight=sample_weight) 
    

    As you can see, this first gets a roc curve, and then calls auc() to get the area.

    I guess your problem is the predict_proba() call. For a normal predict() the outputs are always the same:

    import numpy as np
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_curve, auc, roc_auc_score
    
    est = LogisticRegression(class_weight='auto')
    X = np.random.rand(10, 2)
    y = np.random.randint(2, size=10)
    est.fit(X, y)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict(X))
    print auc(false_positive_rate, true_positive_rate)
    # 0.857142857143
    print roc_auc_score(y, est.predict(X))
    # 0.857142857143
    

    If you change the above for this, you'll sometimes get different outputs:

    false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict_proba(X)[:,1])
    # may differ
    print auc(false_positive_rate, true_positive_rate)
    print roc_auc_score(y, est.predict(X))
    
    0 讨论(0)
  • 2020-12-07 12:57

    When you use the y_pred (class labels), you already decided on the threshold. When you use y_prob (positive class probability) you are open to the threshold, and the ROC Curve should help you decide the threshold.

    For the first case you are using the probabilities:

    y_probs = clf.predict_proba(xtest)[:,1]
    fp_rate, tp_rate, thresholds = roc_curve(y_true, y_probs)
    auc(fp_rate, tp_rate)
    

    When you do that, you're considering the AUC 'before' taking a decision on the threshold you'll be using.

    In the second case, you are using the prediction (not the probabilities), in that case, use 'predict' instead of 'predict_proba' for both and you should get the same result.

    y_pred = clf.predict(xtest)
    fp_rate, tp_rate, thresholds = roc_curve(y_true, y_pred)
    print auc(fp_rate, tp_rate)
    # 0.857142857143
    
    print roc_auc_score(y, y_pred)
    # 0.857142857143
    
    0 讨论(0)
提交回复
热议问题