Update gensim word2vec model

前端未结

关注

 6  604

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the mode

相关标签:

6条回答

梦毁少年i

2020-12-02 23:30
As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.
```
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-02 23:33

First of all, you cannot add new words to a pre-trained model's.

However, there's a "new" doc2vec model published in 2014 which meets all your requirement. You can use it to train a document vector instead of getting a set of word vector then combine them. The best part is doc2vec can infer unseen sentences after training. Although the model is still unchangable, you can get a pretty good inference result based on my experiment.

0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-02 23:34

Problem is that you can not retrain word2vec model with new Sentences. Only doc2vec allows it. Try doc2vec model.

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-02 23:34
You can add to the model vocabulary, and add to the embedding using FastText.
```
from gensim.models import FastText
```
Here you can see some FastText examples. Here you can see how to use FastText to score Out-of-vocabulary (OOV) instances.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-02 23:40

If your model was generated using the C tool load_word2vec_format it is not possible to update that model. See the word2vec tutorial section on Online Training Word2Vec Tutorial:

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-12-02 23:45
1. train() expects a sequence of sentences on input, not one sentence.
2. train() only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train().
0 讨论(0)
发布评论:

提交评论
- 加载中...