Cosine similarity of word2vec more than 1

大憨熊 提交于 2021-02-07 14:49:26

问题


I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms of few words.

I see something like this:

w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     physics| 1.714908638833209|
|     fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
|  psychology| 1.458865636374223|
+------------+------------------+

I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).

Why it is more than 1 here? What's going wrong here?


回答1:


You should normalize the word vectors that you got from word2vec, otherwise you would get unbounded dot product or cosine similarity values.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

How to do normalization?

You can do something like below.

import numpy as np

def normalize(word_vec):
    norm=np.linalg.norm(word_vec)
    if norm == 0: 
       return word_vec
    return word_vec/norm

References

  • Should I do normalization to word embeddings from word2vec if I want to do semantic tasks?
  • Should I normalize word2vec's word vectors before using them?

Update: Why cosine similarity of word2vec is greater than 1?

According to this answer, in spark implementation of word2vec, findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.

The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.



来源:https://stackoverflow.com/questions/41387000/cosine-similarity-of-word2vec-more-than-1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!