Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

若如初见. 提交于 2019-11-30 07:03:26

This is how I technically solved the issue:

Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

sentences = MySentences('newcorpus')  

Setting up the model

model = gensim.models.Word2Vec(sentences)

Intersecting the vocabulary with the google word vectors

model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
                                lockf=1.0,
                                binary=True)

Finally executing the model and updating

model.train(sentences)

A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...

it is possible if model builder didn't finalize the model training . in python it is:

model.sims(replace=True) #finalize the model

if the model didn't finalize it is a perfect way to have model with large dataset.

Some folks have been working on extending gensim to allow online training.

A couple GitHub pull requests you might want to watch for progress on that effort:

It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!