How to remove a word completely from a Word2Vec model in gensim?

后端 未结 4 684
夕颜
夕颜 2020-12-16 13:09

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = [\"Human machine interface for lab abc computer applications\",
\"A survey of u         


        
相关标签:
4条回答
  • 2020-12-16 13:32

    There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.

    The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done

    limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
            dists = dot(limited, mean)
            if not topn:
                return dists
    best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
    

    Update:

    limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
    

    If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited

    the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below

            self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)
    

    so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work

    0 讨论(0)
  • 2020-12-16 13:34

    I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.

    def restrict_w2v(w2v, restricted_word_set):
        new_vectors = []
        new_vocab = {}
        new_index2entity = []
        new_vectors_norm = []
    
        for i in range(len(w2v.vocab)):
            word = w2v.index2entity[i]
            vec = w2v.vectors[i]
            vocab = w2v.vocab[word]
            vec_norm = w2v.vectors_norm[i]
            if word in restricted_word_set:
                vocab.index = len(new_index2entity)
                new_index2entity.append(word)
                new_vocab[word] = vocab
                new_vectors.append(vec)
                new_vectors_norm.append(vec_norm)
    
        w2v.vocab = new_vocab
        w2v.vectors = new_vectors
        w2v.index2entity = new_index2entity
        w2v.index2word = new_index2entity
        w2v.vectors_norm = new_vectors_norm
    

    It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

    Usage:

    w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
    w2v.most_similar("beer")
    

    [('beers', 0.8409687876701355),
    ('lager', 0.7733745574951172),
    ('Beer', 0.71753990650177),
    ('drinks', 0.668931245803833),
    ('lagers', 0.6570086479187012),
    ('Yuengling_Lager', 0.655455470085144),
    ('microbrew', 0.6534324884414673),
    ('Brooklyn_Lager', 0.6501551866531372),
    ('suds', 0.6497018337249756),
    ('brewed_beer', 0.6490240097045898)]

    restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
    restrict_w2v(w2v, restricted_word_set)
    w2v.most_similar("beer")
    

    [('lagers', 0.6570085287094116),
    ('wine', 0.6217695474624634),
    ('bash', 0.20583480596542358),
    ('computer', 0.06677375733852386),
    ('python', 0.005948573350906372)]

    0 讨论(0)
  • 2020-12-16 13:46

    Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.

    Suppose you only want to keep the top 5000 words in your model.

    wv = w2v_model.wv
    words_to_trim = wv.index2word[5000:]
    # In op's case 
    # words_to_trim = ['graph'] 
    ids_to_trim = [wv.vocab[w].index for w in words_to_trim]
    
    for w in words_to_trim:
        del wv.vocab[w]
    
    wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
    wv.init_sims(replace=True)
    
    for i in sorted(ids_to_trim, reverse=True):
        del(wv.index2word[i])
    

    This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.

    The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.

    0 讨论(0)
  • 2020-12-16 13:50

    Have tried and felt that the most straightforward way is as follows:

    1. Get the Word2Vec embeddings in text file format.
    2. Identify the lines corresponding to the word vectors that you would like to keep.
    3. Write a new text file Word2Vec embedding model.
    4. Load model and enjoy (save to binary if you wish, etc.)...

    My sample code is as follows:

    line_no = 0 # line0 = header
    numEntities=0
    targetLines = []
    
    with open(file_entVecs_txt,'r') as fp:
        header = fp.readline() # header
    
        while True:
            line = fp.readline()
            if line == '': #EOF
                break
            line_no += 1
    
            isLatinFlag = True
            for i_l, char in enumerate(line):
                if not isLatin(char): # Care about entity that is Latin-only
                    isLatinFlag = False
                    break
                if char==' ': # reached separator
                    ent = line[:i_l]
                    break
    
            if not isLatinFlag:
                continue
    
            # Check for numbers in entity
            if re.search('\d',ent):
                continue
    
            # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
            if re.match('^ENTITY/.*#',ent):
                continue
    
            targetLines.append(line_no)
            numEntities += 1
    
    # Update header with new metadata
    header_new = re.sub('^\d+',str(numEntities),header,count=1)
    
    # Generate the file
    txtWrite('',file_entVecs_SHORT_txt)
    txtAppend(header_new,file_entVecs_SHORT_txt)
    
    line_no = 0
    ptr = 0
    with open(file_entVecs_txt,'r') as fp:
        while ptr < len(targetLines):
            target_line_no = targetLines[ptr]
    
            while (line_no != target_line_no):
                fp.readline()
                line_no+=1
    
            line = fp.readline()
            line_no+=1
            ptr+=1
            txtAppend(line,file_entVecs_SHORT_txt)
    

    FYI. FAILED ATTEMPT I tried out @zsozso's method (with the np.array modifications suggested by @Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). This is perhaps because I have a lot of entities... But my text-file method works within an hour.

    FAILED CODE

    # [FAILED] Stuck at Building new vocab...
    def restrict_w2v(w2v, restricted_word_set):
        new_vectors = []
        new_vocab = {}
        new_index2entity = []
        new_vectors_norm = []
    
        print('Building new vocab..')
    
        for i in range(len(w2v.vocab)):
    
            if (i%int(1e6)==0) and (i!=0):
                print(f'working on {i}')
    
            word = w2v.index2entity[i]
            vec = np.array(w2v.vectors[i])
            vocab = w2v.vocab[word]
            vec_norm = w2v.vectors_norm[i]
            if word in restricted_word_set:
                vocab.index = len(new_index2entity)
                new_index2entity.append(word)
                new_vocab[word] = vocab
                new_vectors.append(vec)
                new_vectors_norm.append(vec_norm)
    
        print('Assigning new vocab')
        w2v.vocab = new_vocab
        print('Assigning new vectors')
        w2v.vectors = np.array(new_vectors)
        print('Assigning new index2entity, index2word')
        w2v.index2entity = new_index2entity
        w2v.index2word = new_index2entity
        print('Assigning new vectors_norm')
        w2v.vectors_norm = np.array(new_vectors_norm)
    
    0 讨论(0)
提交回复
热议问题