As part of the Enron project, built the attached model, Below is the summary of the steps,
cv = StratifiedS
Basically the grid search will:
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)
GridSearchCV as @Gauthier Feuillen said is used to search best parameters of an estimator for given data. Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)gcv.fit(features,labels)clf_params will be expanded to get all possible combinations separate using ParameterGrid.features will now be split into features_train and features_test using cv. Same for labelsfeatures_train and labels_inner and scored using features_test and labels_test.cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.