Controlling the threshold in Logistic Regression in Scikit Learn

后端 未结 3 1460
自闭症患者
自闭症患者 2020-12-13 09:57

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight featu

相关标签:
3条回答
  • 2020-12-13 10:12

    Yes, Sci-Kit learn is using a threshold of P>0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

    One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

    As another option, one can graphically view precision vs. recall at various thresholds using the following code.

    ### Predict test_y values and probabilities based on fitted logistic 
    regression model
    
    pred_y=log.predict(test_x) 
    
    probs_y=log.predict_proba(test_x) 
      # probs_y is a 2-D array of probability of being labeled as 0 (first 
      column of 
      array) vs 1 (2nd column in array)
    
    from sklearn.metrics import precision_recall_curve
    precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
    1]) 
       #retrieve probability of being 1(in second column of probs_y)
    pr_auc = metrics.auc(recall, precision)
    
    plt.title("Precision-Recall vs Threshold Chart")
    plt.plot(thresholds, precision[: -1], "b--", label="Precision")
    plt.plot(thresholds, recall[: -1], "r--", label="Recall")
    plt.ylabel("Precision, Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="lower left")
    plt.ylim([0,1])
    
    0 讨论(0)
  • 2020-12-13 10:12

    There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

    pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
    threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
    for i in threshold_list:
        print ('\n******** For i = {} ******'.format(i))
        Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
        test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
                                               Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
        print('Our testing accuracy is {}'.format(test_accuracy))
    
        print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
                               Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
    

    Best!

    0 讨论(0)
  • 2020-12-13 10:14

    Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

    Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

    0 讨论(0)
提交回复
热议问题