What are doc2vec training iterations?

為{幸葍}努か 提交于 2019-12-06 02:14:49
gojomo

Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.

Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)

If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.

Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:

model = doc2vec.Doc2Vec(docs, iter=20, ...)

Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)

You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!