How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

后端未结
关注
 2  472
礼貌的吻别 2021-01-18 08:05
I\'m trying to cluster some text documents using scikit-learn. I\'m trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g.

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   天命终不由人
                                             
                
                
                (楼主)
            
              
              
                2021-01-18 08:45
              

            
            
                        
Have you considered implementing the search yourself?

It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

In other words, at which distance are two articles supposed to be clustered?

If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复