问题
I am using three classifiers (RandomForestClassifier
, KNearestNeighborClassifier
, and SVM Classifier
) which you can see below:
>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=True, random_state=41, shrinking=True,
tol=0.001, verbose=False)
>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='distance')
>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
During training, RandomForestClassifer
gives the best f1_score
followed by KNearestNeighborClassifier
, and then SVMClassifier
on the predictions from the data. Here is my X_train (standard scaled values, if needed you can ask how I got this) & y_train:
>> X_train
array([[-0.11034393, -0.72380296, 0.15254572, ..., 0.4166148 ,
-0.91095473, -0.91095295],
[ 1.6817184 , 0.40040944, -0.6770607 , ..., -0.2403781 ,
0.02962478, 0.02962424],
[ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
-0.15848331, -0.15847739],
...,
[-1.18666853, 0.87297522, 0.47136779, ..., -0.19599824,
0.72417473, 0.72416714],
[ 1.6835304 , 0.40605067, -0.63383059, ..., -0.37094083,
0.09505496, 0.09505389],
[ 0.19950709, -1.04624152, -0.18351693, ..., 0.4362658 ,
-0.77994791, -0.77994176]])
>> y_train_sl
874 0
1863 0
1493 0
288 1
260 0
495 0
1529 0
1704 1
75 1
1792 0
626 0
99 1
222 0
774 0
52 1
1688 1
1770 0
53 1
1814 0
488 0
230 0
481 0
132 1
831 0
1166 1
1593 0
771 0
1785 0
616 0
207 0
..
155 1
1506 0
719 0
547 0
613 0
652 0
1351 0
304 0
1689 1
1693 1
1128 0
1323 0
763 0
701 0
467 0
917 0
329 0
375 0
1721 0
928 0
1784 0
1200 0
832 0
986 0
1687 1
643 0
802 0
280 1
1864 0
1045 0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8
As you can see my y_train is in Boolean form (i.e. where the instances are True
and where False
.
I want to improve the accuracy of the predictions further by use of predict_proba
in such a way that when I see that predictions from the classifier (let's say RandomForestClassifier
first) has a low confidence level (<60%) about particular instances it predicted (which is what I am supposed to find first), it moves to the next classifier (let's say KNearestNeighborClassifier
) and check the confidence level of those instances by the next classifier on those instances, if it has a high confidence level compared to the previous classifier (>60%) accept the solution from that classifier instead, similarly if this classifier has a lower confidence level on the same instances still(<60%), move to the next classifier and do the same thing for the third classifier.
Finally, if the third classifier has a lower confidence level (<60%) too, I need to accept the solution from the classifier which has the highest confidence level among all three classifiers.
Since, I am new to Machine Learning I might be confusing you with some of the statements for which I apologize so just correct me where I am wrong.
EDIT:
X_test and y_test are shown below. I need to predict on the X_test_prepared and evaluate the predictions and y_test_sl using f1_score
. The predicted y must have passed through all three classifiers and has the best confidence levels for all the instances.
>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
0.32099948, 0.32099952],
[ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
-0.54261829, -0.54261947],
[ 1.67447042, 0.24530384, -1.0113221 , ..., -0.54844942,
-0.26066608, -0.26066032],
...,
[ 0.28104683, 1.52670909, 0.62653301, ..., -1.15596295,
2.05859487, 2.05859247],
[ 1.50595496, 0.84507934, -0.44109634, ..., -0.71277072,
0.14474518, 0.14474398],
[-1.63423112, -0.12690448, 0.48577783, ..., -0.36025459,
0.29137477, 0.29137047]])
>> y_test_sl
1321 0
1433 0
1859 0
1496 0
492 0
736 0
996 0
1001 0
634 0
1486 0
910 0
1579 0
373 0
1750 0
1563 0
1584 0
51 1
349 0
1162 1
594 0
1121 0
1637 0
1116 0
106 1
1533 0
993 0
960 0
277 0
142 1
1010 0
..
1104 1
1404 0
1646 0
1009 0
61 1
444 0
10 1
704 0
744 0
418 0
998 0
740 0
465 0
97 1
1550 1
1738 0
978 0
690 0
1071 0
1228 1
1539 0
145 1
1015 0
1371 0
1758 0
315 0
71 1
1090 0
1766 0
33 1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8
回答1:
The goal here turned out to create an ensemble of classifiers and take the most "confident" (highest probability class) predictions of all classifiers. The code is below:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_features=4) # Put your training data here instead
# parameters for random forest
rfclf_params = {
'bootstrap': True,
'class_weight':None,
'criterion':'entropy',
'max_depth':None,
'max_features':'auto',
# ... fill in the rest you want here
}
# Fill in svm params here
svm_params = {
'probability':True
}
# KNeighbors params go here
kneighbors_params = {
}
params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]
def ensemble(classifiers, params, X_train, y_train, X_test):
best_preds = np.zeros((len(X_test), 2))
classes = np.unique(y_train)
for i in range(len(classifiers)):
# Construct the classifier by unpacking params
# store classifier instance
clf = classifiers[i](**params[i])
# Fit the classifier as usual and call predict_proba
clf.fit(X_train, y_train)
y_preds = clf.predict_proba(X_test)
# Take maximum probability for each class on each classifier
# This is done for every instance in X_test
# see the docs of np.maximum here:
# https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
best_preds = np.maximum(best_preds, y_preds)
# map the maximum probability for each instance back to its corresponding class
preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
return preds
# Test your predictions
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))
If you want the algorithm to return the highest probabilities instead of the predicted class, have ensemble
return [np.amax(pred_probs) for pred_probs in best_preds]
rather than preds.
来源:https://stackoverflow.com/questions/49396961/improving-the-prediction-score-by-use-of-confidence-level-of-classifiers-on-inst