I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the mode
As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
First of all, you cannot add new words to a pre-trained model's.
However, there's a "new" doc2vec model published in 2014 which meets all your requirement. You can use it to train a document vector instead of getting a set of word vector then combine them. The best part is doc2vec can infer unseen sentences after training. Although the model is still unchangable, you can get a pretty good inference result based on my experiment.
Problem is that you can not retrain word2vec model with new Sentences. Only doc2vec allows it. Try doc2vec model.
You can add to the model vocabulary, and add to the embedding using FastText.
from gensim.models import FastText
Here you can see some FastText examples. Here you can see how to use FastText to score Out-of-vocabulary (OOV) instances.
If your model was generated using the C tool load_word2vec_format it is not possible to update that model. See the word2vec tutorial section on Online Training Word2Vec Tutorial:
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
train()
expects a sequence of sentences on input, not one sentence.
train()
only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train()
.