Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

问题

def n_similarity(self, ws1, ws2):
    v1 = [self[word] for word in ws1]
    v2 = [self[word] for word in ws2]
    return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))

This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?

回答1:

Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).

There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.

来源：https://stackoverflow.com/questions/26010645/why-the-similarity-beteween-two-bag-of-words-in-gensim-word2vec-calculated-this

标签

nlp

gensim

word2vec

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!