可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am following Andrew NG instruction to evaluate the algorithm in Classification:
- Find the Loss Function of the Training Set.
- Compare it with the Loss Function of the Cross Validation.
- If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
- Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm from sklearn.metrics import make_scorer, log_loss from sklearn import datasets def main(): iris = datasets.load_iris() kfold = model_selection.KFold(n_splits=10, random_state=42) model= svm.SVC(kernel='linear', C=1) results = model_selection.cross_val_score(estimator=model, X=iris.data, y=iris.target, cv=kfold, scoring=make_scorer(log_loss, greater_is_better=False)) print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
回答1:
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm from sklearn import datasets iris = datasets.load_iris() kfold = model_selection.KFold(n_splits=10, random_state=42) model= svm.SVC(kernel='linear', C=1) results = model_selection.cross_val_score(estimator=model, X=iris.data, y=iris.target, cv=kfold, scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.
回答2:
This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
- Shuffle your data
KFold(n_splits=10, random_state=42, shuffle = True) - Make n_splits bigger
- provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] } make_scorer(log_loss, greater_is_better=False,**args_loss)
- Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does
回答3:
Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.