Difference between cross_val_score and cross_val_predict

前端未结
关注
 3  2052
青春惊慌失措 2020-12-13 14:14
I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and c

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   眼角桃花
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 14:50
              

            
            
                        
I think the difference can be made clear by inspecting their outputs. Consider this snippet:

# Last column is the label
print(X.shape)  # (7040, 133)

clf = MLPClassifier()

scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
print(scores.shape)  # (5,)

y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
print(y_pred.shape)  # (7040,)


Notice the shapes: why are these so?
scores.shape has length 5 because it is a score computed with cross-validation over 5 folds (see argument cv=5). Therefore, a single real value is computed for each fold. That value is the score of the classifier:


  given true labels and predicted labels, how many answers the predictor were right in a particular fold?


In this case, the y labels given in input are used twice: to learn from data and to evaluate the performances of the classifier.

On the other hand, y_pred.shape has length 7040, which is the shape of the dataset. That is the length of the input dataset. This means that each value is not a score computed on multiple values, but a single value: the prediction of the classifier:


  given the input data and their labels, what is the prediction of the classifier on a specific example that was in a test set of a particular fold?


Note that you do not know what fold was used: each output was computed on the test data of a certain fold, but you can't tell which (from this output, at least).

In this case, the labels are used just once: to train the classifier. It's your job to compare these outputs to the true outputs to compute the score. If you just average them, as you did, the output is not a score, it's just the average prediction.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复