sklearn precision_recall_curve and threshold

走远了吗. 提交于 2021-01-29 17:28:59


I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example

import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

which then gives

    array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
    array([1. , 0.5, 0.5, 0. ])
>>> thresholds
    array([0.35, 0.4 , 0.8 ])

Could someone explain to me how to get those recalls and precisions by showing me what is computed?


I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve() following sklearn implementation.

  1. Decision scores are ordered in descending order and labels according to the just obtained order:

    desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
    y_scores = y_scores[desc_score_indices]
    y_true = y_true[desc_score_indices]

    You'll get:

    y_scores, y_true
    (array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
  2. sklearn implementation then foresees to exclude the duplicated values of y_scores (no duplicates in this example).

    distinct_value_indices = np.where(np.diff(y_scores))[0]
    threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]

    Due to the absence of duplicates you'll get:

    distinct_value_indices, threshold_idxs 
    (array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
  3. Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.

    # tps at index i being the number of positive samples assigned a score >= thresholds[i]
    tps = np.cumsum(y_true)[threshold_idxs]
    # fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
    fps = np.cumsum(1 - y_true)[threshold_idxs]
    y_scores = y_scores[threshold_idxs]

    After this steps you'll have two arrays with the number of true positives and false positives per considered score.

    tps, fps
    (array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
  4. Eventually, you can compute precision and recall.

    precision = tps / (tps + fps)
    # tps[-1] being the total number of positive samples
    recall = tps / tps[-1]
    precision, recall
    (array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))

    An important point that causes the thresholds array to be shorter than the y_score one (even though there are no duplicates in y_score) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall equal to 1 defines the length of the thresholds array (index 2 here, corresponding to length=3 and reason why the length of thresholds is 3).

    last_ind = tps.searchsorted(tps[-1])   # 2
    sl = slice(last_ind, None, -1)         # from index 2 to 0
    precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]
    (array([0.66666667, 0.5       , 1.        , 1.        ]),
    array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))

    Last point, the length of precision and recall is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.

