add new words to GoogleNews by gensim

拜拜、爱过 提交于 2019-12-22 22:27:43

问题


I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.


回答1:


Here is a scenario where we are adding a missing lower case word.

from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)

'Quoran' in embedding.vocab
 Output : True

'quoran' in embedding.vocab
 Output : False

Here Quoran is present but quoran in lower case is missing

# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)

'quoran' in embedding.vocab
 Output : True



回答2:


It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.

It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.

(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)

You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.

And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.



来源:https://stackoverflow.com/questions/50618993/add-new-words-to-googlenews-by-gensim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!