add new words to GoogleNews by gensim

问题

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

回答1:

Here is a scenario where we are adding a missing lower case word.

from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)

'Quoran' in embedding.vocab
 Output : True

'quoran' in embedding.vocab
 Output : False

Here Quoran is present but quoran in lower case is missing

# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)

'quoran' in embedding.vocab
 Output : True

回答2:

It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.

It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.

(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)

You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.

And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.

来源：https://stackoverflow.com/questions/50618993/add-new-words-to-googlenews-by-gensim

标签

python

word2vec

gensim

google-news