Unexpected average of GridSearchCV results

问题

I am trying to understand why I am getting the following situation - I am using the iris data and was doing cross-validation with a k-nearest neighbors classifier to choose the best k.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.33, random_state=42)

parameters = {'n_neighbors': range(1,21)}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters,cv=10)
clf.fit(X_train, Y_train)

The clf object has the results.

print clf.grid_scores_

[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000, std: 0.08456, params: {'n_neighbors': 3}, mean: 0.95000, std: 0.08101, params: {'n_neighbors': 4}, mean: 0.95000, std: 0.08562, params: {'n_neighbors': 5}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 6}, mean: 0.95000, std: 0.08512, params: {'n_neighbors': 7}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 8}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 9}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 10}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 11}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 12}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 13}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 14}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 15}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 16}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 17}, mean: 0.93000, std: 0.09458, params: {'n_neighbors': 18}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 19}, mean: 0.93000, std: 0.10887, params: {'n_neighbors': 20}]

however when I get the 10 CV results for the first case k=1

print clf.grid_scores_[0].cv_validation_scores

we get

array([ 1.        ,  0.90909091,  1.        ,  0.72727273,  0.9       ,
        1.        ,  1.        ,  1.        ,  1.        ,  0.88888889])

However, the mean of these 10 observations

print clf.grid_scores_[0].cv_validation_scores.mean()

is 0.942525252525, not the 0.940000 presented on the object.

So, I am very confused as to what the mean value is doing and why it is not the same. I read the documentation and I did not find anything that would help me. What am I missing?

回答1:

One of the parameters of GridSearchCV is "iid". It takes a default value of True, and the description reads:

If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

Essentially, the grid_scores_ function by default outputs the mean loss across all the samples rather than the mean loss across the folds. If the number of data points in each fold is not the same (i.e. if the number of samples is not divisible by 10, since you're doing 10-fold cross validation), then these numbers won't match.

来源：https://stackoverflow.com/questions/27303813/unexpected-average-of-gridsearchcv-results

标签

python-2.7

scikit-learn