gensim doc2vec “intersect_word2vec_format” command

╄→尐↘猪︶ㄣ 提交于 2019-12-02 05:15:21

Yes, the intersect_word2vec_format() will let you bring vectors from an external file into a model that's already had its own vocabulary initialized (as if by build_vocab()). That is, it will only load those vectors for which there are already words in the local vocabulary.

Additionally, it will by default lock those loaded vectors against any further adjustment during subsequent training, though other words in the pre-existing vocabulary may continue to update. (You can change this behavior by supplying a lockf=1.0 value instead of the default 0.0.)

However, this is best considered an experimental function and what, if any, benefits it might offer will depend on lots of things specific to your setup.

The PV-DBOW Doc2Vec mode, corresponding to the dm=0 parameter, is often a top-performer in speed and doc-vector quality, and doesn't use or train word-vectors at all – so any pre-loading of vectors won't have any effect.

The PV-DM mode, enabled by the default dm=1 setting, trains any word-vectors it needs simultaneous with doc-vector training. (That is, there's no separate phase where word-vectors are created first, and thus for the same iter passes, PV-DM training takes the same amount of time whether word-vectors start with default random values, or are pre-loaded from elsewhere.) Pre-seeding the model with some word-vectors from elsewhere might help or hurt final quality – it's likely to depend on the specifics of your corpus, meta-parameters, and goals, and whether those external vectors represent word-meanings in sync with the current corpus/goal.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!