Get most similar words, given the vector of the word (not the word itself)

帅比萌擦擦* 提交于 2019-12-20 09:55:12

问题


Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)

I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

model.most_similar(positive=[VECTOR], topn=N)

I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]

and then query the German model with these vectors.

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)

I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.

Do you know if there is already a method in gensim.models.Word2Vec library or other similar libraries which does this? Do I need to implement it by myself?


回答1:


The method similar_by_vector returns the top-N most similar words by vector:

similar_by_vector(vector, topn=10, restrict_vocab=None)



回答2:


I don't think what you're trying to achieve could ever give an accurate answer. Simply because the two models are trained separately. And although both the English and the German model will have similar distances between their respective word vectors. There's no guarantee that the word vector for 'House' will have the same direction as the word vector for 'Haus'.

In simple terms, if you trained both models with vector size=3. And 'House' has vector [0.5,0.2,0.9], there's no guarantee that 'Haus' will have vector [0.5,0.2,0.9] or even something close to that.

In order to solve this, you could first translate the English word to German and then use the vector for that word to look for similar words in the German model.

TL:DR; You can't just plug in vectors from one language model into another and expect to have accurate results.



来源:https://stackoverflow.com/questions/37818426/get-most-similar-words-given-the-vector-of-the-word-not-the-word-itself

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!