How to remove a word completely from a Word2Vec model in gensim?

后端未结

关注

 4  686

夕颜 2020-12-16 13:09

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = [\"Human machine interface for lab abc computer applications\",
\"A survey of u


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   伪装坚强ぢ
                                             
                
                
                (楼主)
            
              
              
                2020-12-16 13:32
              

            
            
                        
There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs. 

The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


Update:

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited

the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below

        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复