Get most similar words, given the vector of the word (not the word itself)

问题

Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)

I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

model.most_similar(positive=[VECTOR], topn=N)

I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]

and then query the German model with these vectors.

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)

I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.

Do you know if there is already a method in gensim.models.Word2Vec library or other similar libraries which does this? Do I need to implement it by myself?

回答1:

The method similar_by_vector returns the top-N most similar words by vector:

similar_by_vector(vector, topn=10, restrict_vocab=None)

回答2:

I don't think what you're trying to achieve could ever give an accurate answer. Simply because the two models are trained separately. And although both the English and the German model will have similar distances between their respective word vectors. There's no guarantee that the word vector for 'House' will have the same direction as the word vector for 'Haus'.

In simple terms, if you trained both models with vector size=3. And 'House' has vector [0.5,0.2,0.9], there's no guarantee that 'Haus' will have vector [0.5,0.2,0.9] or even something close to that.

In order to solve this, you could first translate the English word to German and then use the vector for that word to look for similar words in the German model.

TL:DR; You can't just plug in vectors from one language model into another and expect to have accurate results.

来源：https://stackoverflow.com/questions/37818426/get-most-similar-words-given-the-vector-of-the-word-not-the-word-itself

标签

python

gensim

word2vec