I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and c
I think the difference can be made clear by inspecting their outputs. Consider this snippet:
# Last column is the label
print(X.shape) # (7040, 133)
clf = MLPClassifier()
scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
print(scores.shape) # (5,)
y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
print(y_pred.shape) # (7040,)
Notice the shapes: why are these so?
scores.shape has length 5 because it is a score computed with cross-validation over 5 folds (see argument cv=5). Therefore, a single real value is computed for each fold. That value is the score of the classifier:
given true labels and predicted labels, how many answers the predictor were right in a particular fold?
In this case, the y labels given in input are used twice: to learn from data and to evaluate the performances of the classifier.
On the other hand, y_pred.shape has length 7040, which is the shape of the dataset. That is the length of the input dataset. This means that each value is not a score computed on multiple values, but a single value: the prediction of the classifier:
given the input data and their labels, what is the prediction of the classifier on a specific example that was in a test set of a particular fold?
Note that you do not know what fold was used: each output was computed on the test data of a certain fold, but you can't tell which (from this output, at least).
In this case, the labels are used just once: to train the classifier. It's your job to compare these outputs to the true outputs to compute the score. If you just average them, as you did, the output is not a score, it's just the average prediction.