Scikit-learn predict_proba gives wrong answers

前端未结

关注

 3  1866

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the followin

相关标签:

3条回答

面向向阳花

2020-12-02 11:43
predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:
- How does sklearn.svm.svc's function predict_proba() work internally?
So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.
0 讨论(0)
发布评论:

提交评论
- 加载中...

借酒劲吻你

2020-12-02 12:01

Food for thought here. I think i actually got predict_proba to work as is. Please see code below...

# Test data
TX = [[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13,14,15], [16,17,18], [19,20,21], [22,23,24]]
TY = ['apple', 'orange', 'grape', 'kiwi', 'mango','peach','banana','pear']

VX2 = [[16,17,18], [19,20,21], [22,23,24], [13,14,15], [10,11,12], [7,8,9], [4,5,6], [1,2,3]]
VY2 = ['peach','banana','pear','mango', 'kiwi', 'grape', 'orange','apple']

VX2_df = pd.DataFrame(data=VX2) # convert to dataframe
VX2_df = VX2_df.rename(index=float, columns={0: "N0", 1: "N1", 2: "N2"})
VY2_df = pd.DataFrame(data=VY2) # convert to dataframe
VY2_df = VY2_df.rename(index=float, columns={0: "label"})

# NEW - in testing
def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    #top_n_predictions = np.argsort(probas)[:,:-n-1:-1]
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_socs = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_n_df = pd.DataFrame(data=top_socs)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_n_df, left_index=True, right_index=True)

    conditions = [
        (results['label'] == results[0]),
        (results['label'] == results[1]),
        (results['label'] == results[2]),
        (results['label'] == results[3]),
        (results['label'] == results[4])]
    choices = [1, 1, 1, 1, 1]
    results['Successes'] = np.select(conditions, choices, default=0)

    print("Top 5 Accuracy Rate = ", sum(results['Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = ", metrics.accuracy_score(predictions, valid_y))

train_model(naive_bayes.MultinomialNB(), TX, TY, VX2, VY2_df, VX2_df)

Output: Top 5 Accuracy Rate = 1.0 Top 1 Accuracy Rate = 1.0

Couldn't get it to work for my own data though :(

0 讨论(0)

甜味超标

2020-12-02 12:02

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

0 讨论(0)
发布评论:

提交评论
- 加载中...