Unexpected average of GridSearchCV results

左心房为你撑大大i 提交于 2019-12-14 02:14:00

问题


I am trying to understand why I am getting the following situation - I am using the iris data and was doing cross-validation with a k-nearest neighbors classifier to choose the best k.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.33, random_state=42)

parameters = {'n_neighbors': range(1,21)}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters,cv=10)
clf.fit(X_train, Y_train)

The clf object has the results.

print clf.grid_scores_

[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000, std: 0.08456, params: {'n_neighbors': 3}, mean: 0.95000, std: 0.08101, params: {'n_neighbors': 4}, mean: 0.95000, std: 0.08562, params: {'n_neighbors': 5}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 6}, mean: 0.95000, std: 0.08512, params: {'n_neighbors': 7}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 8}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 9}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 10}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 11}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 12}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 13}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 14}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 15}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 16}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 17}, mean: 0.93000, std: 0.09458, params: {'n_neighbors': 18}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 19}, mean: 0.93000, std: 0.10887, params: {'n_neighbors': 20}]

however when I get the 10 CV results for the first case k=1

print clf.grid_scores_[0].cv_validation_scores

we get

array([ 1.        ,  0.90909091,  1.        ,  0.72727273,  0.9       ,
        1.        ,  1.        ,  1.        ,  1.        ,  0.88888889])

However, the mean of these 10 observations

print clf.grid_scores_[0].cv_validation_scores.mean()

is 0.942525252525, not the 0.940000 presented on the object.

So, I am very confused as to what the mean value is doing and why it is not the same. I read the documentation and I did not find anything that would help me. What am I missing?


回答1:


One of the parameters of GridSearchCV is "iid". It takes a default value of True, and the description reads:

If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

Essentially, the grid_scores_ function by default outputs the mean loss across all the samples rather than the mean loss across the folds. If the number of data points in each fold is not the same (i.e. if the number of samples is not divisible by 10, since you're doing 10-fold cross validation), then these numbers won't match.



来源:https://stackoverflow.com/questions/27303813/unexpected-average-of-gridsearchcv-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!