How to handle words that are not in word2vec's vocab optimally

大憨熊 提交于 2019-12-06 02:16:38

You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:

from gensim.models import FastText

model = FastText(sentences=training_data, size=128, ...)

word = 'hello' # can be out of vocabulary
embedding = model[word] # fetches the word embedding

Usually Doc2Vec text-vector usefulness is quite-similar (or when tuned, a little better) compared to a plain average-of-word-vectors. (After all, the algorithms are very similar, working on the same form of the same data, and the models created are about the same size.) If there was a big drop-off, there may have been errors in your Doc2Vec process.

As @AnnaKrogager notes, FastText can handle out-of-vocabulary words by synthesizing guesswork vectors, using word-fragments. (This requires languages where words have such shared roots.) The vectors may not be great but are often better than either ignoring unknown words entirely, or using all-zero-vectors or random-plug-vectors.

Is splitting it among processes helping the runtime at all? Because there's a lot of overhead in sending batches-of-work to-and-from subprocesses, and subprocesses in Python can cause a ballooning of memory needs – and both that overhead and possibly even virtual-memory swapping could outweigh any other benefits of parallelism.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!