gensim word2vec - updating word embeddings with newcoming data

冷暖自知 提交于 2019-12-23 04:52:51

问题


I have trained 26 million tweets with skipgram technique to create word embeddings as follows:

sentences = gensim.models.word2vec.LineSentence('/.../data/tweets_26M.txt')
model = gensim.models.word2vec.Word2Vec(sentences, window=2, sg=1, size=200, iter=20)
model.save_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True)

However, I am continuously collecting more tweets in my database. For example, when I have 2 million more tweets, I wanna update my embeddings with also considering this newcoming 2M tweets.

Is it possible to load previously trained model and update weights of embeddings (maybe adding new word embeddings to my model)? Or do I need to 28 (26+2) million tweets from beginning? It takes 5 hours with current parameters and will take longer with a bigger data.

One other question, is it possible to retrieve sentences parameter directly from database (instead of reading it from txt, bz2 or gz files)? As our data to be trained is getting bigger, it would be better to bypassing text read/write operations.

来源:https://stackoverflow.com/questions/40727093/gensim-word2vec-updating-word-embeddings-with-newcoming-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!