How to remove a word completely from a Word2Vec model in gensim?

后端 未结 4 692
夕颜
夕颜 2020-12-16 13:09

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = [\"Human machine interface for lab abc computer applications\",
\"A survey of u         


        
4条回答
  •  猫巷女王i
    2020-12-16 13:46

    Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.

    Suppose you only want to keep the top 5000 words in your model.

    wv = w2v_model.wv
    words_to_trim = wv.index2word[5000:]
    # In op's case 
    # words_to_trim = ['graph'] 
    ids_to_trim = [wv.vocab[w].index for w in words_to_trim]
    
    for w in words_to_trim:
        del wv.vocab[w]
    
    wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
    wv.init_sims(replace=True)
    
    for i in sorted(ids_to_trim, reverse=True):
        del(wv.index2word[i])
    

    This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.

    The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.

提交回复
热议问题