After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

问题

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:

from gensim.models.fasttext import FastText as FT_gensim

model_gensim = FT_gensim(size=100)

# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)

# train the model
model_gensim.train(
    corpus_file=corpus_file, epochs=model_gensim.epochs,
    total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)

Then, let's say I want to get the embeddings vectors associated with this sentences:

sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

How can I get them with model_gensim that I trained previously?

回答1:

You can look up each word's vector in turn:

wordvecs_obama = [model_gensim[word] for word in sentence_obama]

For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.

All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)

There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.

You could approximate what that FastText mode does by averaging together those word-vectors:

import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)

Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

来源：https://stackoverflow.com/questions/57079642/after-training-word-embedding-with-gensims-fasttexts-wrapper-how-to-embed-new

标签

machine-learning

nlp

gensim

embedding