How to handle words that are not in word2vec's vocab optimally

I have a list of ~10 million sentences, where each of them contains up to 70 words.

I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab.

To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros.

The issue is that calculating every average takes a very long time when I run it on the whole dataset, even when splitting into multiple threads, and I would like to get a better solution that could run faster.

I'm running this on an EC2 r4.4xlarge instance.

I already tried switching to doc2vec, which was way faster, but the results were not as good as word2vec's simple average.

word2vec_aug_32x = Word2Vec(sentences=sentences, 
                        min_count=1000, 
                        size=32, 
                        window=2,
                        workers=16, 
                        sg=0)

vocab_arr = np.array(list(word2vec_aug_32x.wv.vocab.keys()))

def get_embedded_average(sentence):
    sentence = np.intersect1d(sentence, vocab_arr)
    if sentence.shape[0] > 0:
        return np.mean(word2vec_aug_32x[sentence], axis=0).tolist()
    else:
        return np.zeros(32).tolist()

pool = multiprocessing.Pool(processes=16)

w2v_averages = np.asarray(pool.map(get_embedded_average, np.asarray(sentences)))
pool.close()

If you have any suggestions of different algorithms or techniques that have the same purpose of sentence embedding and could solve my problem, I would love to read about it.

You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:

from gensim.models import FastText

model = FastText(sentences=training_data, size=128, ...)

word = 'hello' # can be out of vocabulary
embedding = model[word] # fetches the word embedding

Usually Doc2Vec text-vector usefulness is quite-similar (or when tuned, a little better) compared to a plain average-of-word-vectors. (After all, the algorithms are very similar, working on the same form of the same data, and the models created are about the same size.) If there was a big drop-off, there may have been errors in your Doc2Vec process.

As @AnnaKrogager notes, FastText can handle out-of-vocabulary words by synthesizing guesswork vectors, using word-fragments. (This requires languages where words have such shared roots.) The vectors may not be great but are often better than either ignoring unknown words entirely, or using all-zero-vectors or random-plug-vectors.

Is splitting it among processes helping the runtime at all? Because there's a lot of overhead in sending batches-of-work to-and-from subprocesses, and subprocesses in Python can cause a ballooning of memory needs – and both that overhead and possibly even virtual-memory swapping could outweigh any other benefits of parallelism.

来源：https://stackoverflow.com/questions/54709178/how-to-handle-words-that-are-not-in-word2vecs-vocab-optimally

标签

python

numpy

optimization

gensim

word2vec