I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:
new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)
and its printing this as logs:
2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s
Now, when I query with similar new_sentence for most positives (as model.most_similar(positive=new_sentence)
) it gives out error:
Traceback (most recent call last):
File "<pyshell#220>", line 1, in <module>
model.most_similar(positive=['moscow', 'weather', 'cold'])
File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'cold' not in vocabulary"
Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)?
So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?
train()
expects a sequence of sentences on input, not one sentence.train()
only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) usingtrain()
.
As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
If your model was generated using the C tool load_word2vec_format it is not possible to update that model. See the word2vec tutorial section on Online Training Word2Vec Tutorial:
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
First of all, you cannot add new words to a pre-trained model's.
However, there's a "new" doc2vec model published in 2014 which meets all your requirement. You can use it to train a document vector instead of getting a set of word vector then combine them. The best part is doc2vec can infer unseen sentences after training. Although the model is still unchangable, you can get a pretty good inference result based on my experiment.
Problem is that you can not retrain word2vec model with new Sentences. Only doc2vec allows it. Try doc2vec model.
来源:https://stackoverflow.com/questions/22121028/update-gensim-word2vec-model