Improving the prediction score by use of confidence level of classifiers on instances

问题

I am using three classifiers (RandomForestClassifier, KNearestNeighborClassifier, and SVM Classifier) which you can see below:

>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=41, shrinking=True,
  tol=0.001, verbose=False)

>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

During training, RandomForestClassifer gives the best f1_score followed by KNearestNeighborClassifier, and then SVMClassifier on the predictions from the data. Here is my X_train (standard scaled values, if needed you can ask how I got this) & y_train:

>> X_train
array([[-0.11034393, -0.72380296,  0.15254572, ...,  0.4166148 ,
        -0.91095473, -0.91095295],
       [ 1.6817184 ,  0.40040944, -0.6770607 , ..., -0.2403781 ,
         0.02962478,  0.02962424],
       [ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
        -0.15848331, -0.15847739],
       ..., 
       [-1.18666853,  0.87297522,  0.47136779, ..., -0.19599824,
         0.72417473,  0.72416714],
       [ 1.6835304 ,  0.40605067, -0.63383059, ..., -0.37094083,
         0.09505496,  0.09505389],
       [ 0.19950709, -1.04624152, -0.18351693, ...,  0.4362658 ,
        -0.77994791, -0.77994176]])

>> y_train_sl
874     0
1863    0
1493    0
288     1
260     0
495     0
1529    0
1704    1
75      1
1792    0
626     0
99      1
222     0
774     0
52      1
1688    1
1770    0
53      1
1814    0
488     0
230     0
481     0
132     1
831     0
1166    1
1593    0
771     0
1785    0
616     0
207     0
       ..
155     1
1506    0
719     0
547     0
613     0
652     0
1351    0
304     0
1689    1
1693    1
1128    0
1323    0
763     0
701     0
467     0
917     0
329     0
375     0
1721    0
928     0
1784    0
1200    0
832     0
986     0
1687    1
643     0
802     0
280     1
1864    0
1045    0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8

As you can see my y_train is in Boolean form (i.e. where the instances are True and where False.

I want to improve the accuracy of the predictions further by use of predict_proba in such a way that when I see that predictions from the classifier (let's say RandomForestClassifier first) has a low confidence level (<60%) about particular instances it predicted (which is what I am supposed to find first), it moves to the next classifier (let's say KNearestNeighborClassifier) and check the confidence level of those instances by the next classifier on those instances, if it has a high confidence level compared to the previous classifier (>60%) accept the solution from that classifier instead, similarly if this classifier has a lower confidence level on the same instances still(<60%), move to the next classifier and do the same thing for the third classifier.

Finally, if the third classifier has a lower confidence level (<60%) too, I need to accept the solution from the classifier which has the highest confidence level among all three classifiers.

Since, I am new to Machine Learning I might be confusing you with some of the statements for which I apologize so just correct me where I am wrong.

EDIT: X_test and y_test are shown below. I need to predict on the X_test_prepared and evaluate the predictions and y_test_sl using f1_score. The predicted y must have passed through all three classifiers and has the best confidence levels for all the instances.

>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
         0.32099948,  0.32099952],
       [ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
        -0.54261829, -0.54261947],
       [ 1.67447042,  0.24530384, -1.0113221 , ..., -0.54844942,
        -0.26066608, -0.26066032],
       ...,
       [ 0.28104683,  1.52670909,  0.62653301, ..., -1.15596295,
         2.05859487,  2.05859247],
       [ 1.50595496,  0.84507934, -0.44109634, ..., -0.71277072,
         0.14474518,  0.14474398],
       [-1.63423112, -0.12690448,  0.48577783, ..., -0.36025459,
         0.29137477,  0.29137047]])

>> y_test_sl
1321    0
1433    0
1859    0
1496    0
492     0
736     0
996     0
1001    0
634     0
1486    0
910     0
1579    0
373     0
1750    0
1563    0
1584    0
51      1
349     0
1162    1
594     0
1121    0
1637    0
1116    0
106     1
1533    0
993     0
960     0
277     0
142     1
1010    0
       ..
1104    1
1404    0
1646    0
1009    0
61      1
444     0
10      1
704     0
744     0
418     0
998     0
740     0
465     0
97      1
1550    1
1738    0
978     0
690     0
1071    0
1228    1
1539    0
145     1
1015    0
1371    0
1758    0
315     0
71      1
1090    0
1766    0
33      1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8

回答1:

The goal here turned out to create an ensemble of classifiers and take the most "confident" (highest probability class) predictions of all classifiers. The code is below:

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_features=4) # Put your training data here instead

# parameters for random forest
rfclf_params = {
    'bootstrap': True, 
    'class_weight':None, 
    'criterion':'entropy',
    'max_depth':None, 
    'max_features':'auto', 
    # ... fill in the rest you want here
}

# Fill in svm params here
svm_params = {
    'probability':True
}

# KNeighbors params go here
kneighbors_params = {

}

params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]

def ensemble(classifiers, params, X_train, y_train, X_test):
    best_preds = np.zeros((len(X_test), 2))
    classes = np.unique(y_train)

    for i in range(len(classifiers)):
        # Construct the classifier by unpacking params 
        # store classifier instance
        clf = classifiers[i](**params[i])
        # Fit the classifier as usual and call predict_proba
        clf.fit(X_train, y_train)
        y_preds = clf.predict_proba(X_test)
        # Take maximum probability for each class on each classifier 
        # This is done for every instance in X_test
        # see the docs of np.maximum here: 
        # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
        best_preds = np.maximum(best_preds, y_preds)

    # map the maximum probability for each instance back to its corresponding class
    preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
    return preds

# Test your predictions  
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))

If you want the algorithm to return the highest probabilities instead of the predicted class, have ensemble return [np.amax(pred_probs) for pred_probs in best_preds] rather than preds.

来源：https://stackoverflow.com/questions/49396961/improving-the-prediction-score-by-use-of-confidence-level-of-classifiers-on-inst

标签

python

machine-learning

boolean

text-classification