Scikit-learn predict_proba gives wrong answers

前端 未结 3 1858
悲哀的现实
悲哀的现实 2020-12-02 11:28

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the followin

相关标签:
3条回答
  • 2020-12-02 11:43

    predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:

    • How does sklearn.svm.svc's function predict_proba() work internally?

    So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

    0 讨论(0)
  • 2020-12-02 12:01

    Food for thought here. I think i actually got predict_proba to work as is. Please see code below...

    # Test data
    TX = [[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13,14,15], [16,17,18], [19,20,21], [22,23,24]]
    TY = ['apple', 'orange', 'grape', 'kiwi', 'mango','peach','banana','pear']
    
    VX2 = [[16,17,18], [19,20,21], [22,23,24], [13,14,15], [10,11,12], [7,8,9], [4,5,6], [1,2,3]]
    VY2 = ['peach','banana','pear','mango', 'kiwi', 'grape', 'orange','apple']
    
    VX2_df = pd.DataFrame(data=VX2) # convert to dataframe
    VX2_df = VX2_df.rename(index=float, columns={0: "N0", 1: "N1", 2: "N2"})
    VY2_df = pd.DataFrame(data=VY2) # convert to dataframe
    VY2_df = VY2_df.rename(index=float, columns={0: "label"})
    
    # NEW - in testing
    def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
    
        # fit the training dataset on the classifier
        classifier.fit(feature_vector_train, label)
    
        # predict the top n labels on validation dataset
        n = 5
        #classifier.probability = True
        probas = classifier.predict_proba(feature_vector_valid)
        predictions = classifier.predict(feature_vector_valid)
    
        #Identify the indexes of the top predictions
        #top_n_predictions = np.argsort(probas)[:,:-n-1:-1]
        top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
    
        #then find the associated SOC code for each prediction
        top_socs = classifier.classes_[top_n_predictions]
    
        #cast to a new dataframe
        top_n_df = pd.DataFrame(data=top_socs)
    
        #merge it up with the validation labels and descriptions
        results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
        results = pd.merge(results, top_n_df, left_index=True, right_index=True)
    
        conditions = [
            (results['label'] == results[0]),
            (results['label'] == results[1]),
            (results['label'] == results[2]),
            (results['label'] == results[3]),
            (results['label'] == results[4])]
        choices = [1, 1, 1, 1, 1]
        results['Successes'] = np.select(conditions, choices, default=0)
    
        print("Top 5 Accuracy Rate = ", sum(results['Successes'])/results.shape[0])
        print("Top 1 Accuracy Rate = ", metrics.accuracy_score(predictions, valid_y))
    
    train_model(naive_bayes.MultinomialNB(), TX, TY, VX2, VY2_df, VX2_df)
    

    Output: Top 5 Accuracy Rate = 1.0 Top 1 Accuracy Rate = 1.0

    Couldn't get it to work for my own data though :(

    0 讨论(0)
  • 2020-12-02 12:02

    if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

    the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

    0 讨论(0)
提交回复
热议问题