Difference between cross_val_score and cross_val_predict

前端 未结 3 2052
青春惊慌失措
青春惊慌失措 2020-12-13 14:14

I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and c

3条回答
  •  眼角桃花
    2020-12-13 14:50

    I think the difference can be made clear by inspecting their outputs. Consider this snippet:

    # Last column is the label
    print(X.shape)  # (7040, 133)
    
    clf = MLPClassifier()
    
    scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
    print(scores.shape)  # (5,)
    
    y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
    print(y_pred.shape)  # (7040,)
    

    Notice the shapes: why are these so? scores.shape has length 5 because it is a score computed with cross-validation over 5 folds (see argument cv=5). Therefore, a single real value is computed for each fold. That value is the score of the classifier:

    given true labels and predicted labels, how many answers the predictor were right in a particular fold?

    In this case, the y labels given in input are used twice: to learn from data and to evaluate the performances of the classifier.

    On the other hand, y_pred.shape has length 7040, which is the shape of the dataset. That is the length of the input dataset. This means that each value is not a score computed on multiple values, but a single value: the prediction of the classifier:

    given the input data and their labels, what is the prediction of the classifier on a specific example that was in a test set of a particular fold?

    Note that you do not know what fold was used: each output was computed on the test data of a certain fold, but you can't tell which (from this output, at least).

    In this case, the labels are used just once: to train the classifier. It's your job to compare these outputs to the true outputs to compute the score. If you just average them, as you did, the output is not a score, it's just the average prediction.

提交回复
热议问题