Retrieving specific classifiers and data from GridSearchCV

问题

I am running a Python 3 classification script on a server using the following code:

# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()

# define KNN parameters
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'n_jobs': [-1],
    'weights': ['uniform', 'distance']}]

# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')

# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)

I then save the GridSearchCV object using pickle:

# save model
with open('knn_models.pickle', 'wb') as f:
    pickle.dump(knn_models, f)

So I can test the classifiers on smaller datasets on my local machine by running:

knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_

Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:

pull the original data out of the GridSearchCV object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required)
try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e. k = 3, 5, 7
retrieve y_pred i.e. the predictions for each validation set for all of the new classifiers that I tested above

回答1:

As discussed in the comments, GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_ returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_ attribute.

Adapting the example from the documentation to the knn classifier with your own knn_parameters grid (but removing n_jobs, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3 for simplicity, we have:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd

iris = load_iris()
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'weights': ['uniform', 'distance']}]

knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)

clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs entry from the knn_parameters grid and asking instead for n_jobs=-1 in the GridSearchCV object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1 to your final model, you can easily manipulate the best_estimator_ to do so:

clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

This actually answers your second question, since you can similarly manipulate the best_estimator_ to change other hyperparameters, too.

So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_ attribute, which you can even import to a pandas dataframe for easier inspection:

cv_results = pd.DataFrame.from_dict(clf.cv_results_)

For example, the cv_results dataframe includes a column rank_test_score which, as its name clearly implies, contains the rank of each parameter combination:

cv_results['rank_test_score']
# result:
0      481
1      481
2      145
3      145
4        1
      ... 
571      1
572    145
573    145
574    433
575      1
Name: rank_test_score, Length: 576, dtype: int32

Here 1 means best, and you can readily see that there are more than one combinations ranked as 1 - so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_ is just the first of these occurrences - here the combination number 4:

cv_results.iloc[4]
# result:
mean_fit_time                                              0.000669559
std_fit_time                                               1.55811e-05
mean_score_time                                             0.00474652
std_score_time                                             0.000488042
param_algorithm                                                   auto
param_leaf_size                                                      5
param_n_neighbors                                                    5
param_weights                                                  uniform
params               {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score                                                 0.98
split1_test_score                                                 0.98
split2_test_score                                                 0.98
mean_test_score                                                   0.98
std_test_score                                                       0
rank_test_score                                                      1
Name: 4, dtype: object

which you can easily see that has the same parameters with our best_estimator_ above. But now you can inspect all the "best" models, simply by:

cv_results.loc[cv_results['rank_test_score']==1]

which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576 models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.

Hopefully these will be enough to get you started...

来源：https://stackoverflow.com/questions/61687800/retrieving-specific-classifiers-and-data-from-gridsearchcv

标签

python

machine-learning

scikit-learn

knn

grid-search