Updating training documents for gensim Doc2Vec model

别说谁变了你拦得住时间么 提交于 2019-12-02 07:52:07

Gensim Doc2Vec doesn't yet have official support for expanding-the-vocabulary (via build_vocab(..., update=True)), so the model's behavior here is not defined to do anything useful. In fact, I think any existing doc-tags will be completely discarded and replaced with any in the latest corpus. (Additionally, there are outstanding unresolved reports of memory-fault process-crashes when trying to use update_vocab() with Doc2Vec, such as this issue.)

Even if that worked, there are a number of murky balancing issues to consider if ever continuing to call train() on a model with texts different than the initial training-set. In particular, each such training session will nudge the model to be better on the new examples, but lose value of the original training, possibly making the model worse for some cases or overall.

The most defensible policy with a growing corpus would be to occasionally retrain from scratch with all training examples combined into one corpus. Another outline of a possible process for rolling updates to a model was discussed in my recent post to the gensim discussion list.

A few other comments on your setup:

  • using both hierarchical-softmax (hs=1) and negative sampling (with negative > 0) increases the model size and training time, but may not offer any advantage compared to using just one mode with more iterations (or other tweaks) – so it's rare to have both modes active

  • by not specifying an iter, you're using the default-inherited-from-Word2Vec of '5', while published Doc2Vec work often uses 10-20 or more iterations

  • many report infer_vector working better with a much-higher value for its optional parameter steps (which has a default of only 5), and/or with smaller values of alpha (which has a default of 0.1)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!