Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

匿名 (未验证) 提交于 2019-12-03 01:00:01

问题:

I am following Andrew NG instruction to evaluate the algorithm in Classification:

  1. Find the Loss Function of the Training Set.
  2. Compare it with the Loss Function of the Cross Validation.
  3. If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
  4. Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.

I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):

from sklearn import model_selection, svm from sklearn.metrics import make_scorer, log_loss from sklearn import datasets  def main():      iris = datasets.load_iris()     kfold = model_selection.KFold(n_splits=10, random_state=42)     model= svm.SVC(kernel='linear', C=1)     results = model_selection.cross_val_score(estimator=model,                                               X=iris.data,                                               y=iris.target,                                               cv=kfold,                                               scoring=make_scorer(log_loss, greater_is_better=False))      print(results) 

Error

ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument. 

I am not sure even it's the right way to start. Any help is very much appreciated.

回答1:

Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:

from sklearn import model_selection, svm from sklearn import datasets  iris = datasets.load_iris() kfold = model_selection.KFold(n_splits=10, random_state=42) model= svm.SVC(kernel='linear', C=1) results = model_selection.cross_val_score(estimator=model,                                               X=iris.data,                                               y=iris.target,                                               cv=kfold,                                               scoring="accuracy")  # change  

Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).

For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.



回答2:

This kind of error appears often when you do cross validation.

Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.

So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.

What do you do in this case?

You have three possibilities:

  1. Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
  2. Make n_splits bigger
  3. provide the list of labels explicitly to the loss function as follows

args_loss = { "labels": [0,1,2] } make_scorer(log_loss, greater_is_better=False,**args_loss)

  1. Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does


回答3:

Just for future readers who are following Andrew's Course:

K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.

Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!