How to use word2vec to calculate the similarity distance by giving 2 words?

后端未结

关注

 4  813

Word2vec is a open source tool to calculate the words distance provided by Google. It can be used by inputting a word and output the ranked word lists according to the simil

相关标签:

4条回答

逝去的感伤

2020-12-12 18:48
As you know word2vec can represent a word as a mathematical vector. So once you train the model, you can obtain the vectors of the words spain and france and compute the cosine distance (dot product).

An easy way to do this is to use this Python wrapper of word2vec. You can obtain the vector using this:
```
>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
```
To compute the distances between two words, you can do the following:
```
>>> import numpy    
>>> cosine_similarity = numpy.dot(model['spain'], model['france'])/(numpy.linalg.norm(model['spain'])* numpy.linalg.norm(model['france']))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-12-12 18:50

I have developed a code to help with calculating cosine similarity for 2 sentences / SKUs using gensim. The code can be found here https://github.com/aviralmathur/Word2Vec

The code is using data for Kaggle competition on Crowdflower

It has been developed using Code for Kaggle Tutorial on Word2Vec available here https://www.kaggle.com/c/word2vec-nlp-tutorial

I hope this helps

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-12-12 18:53
gensim has a Python implementation of Word2Vec which provides an in-built utility for finding similarity between two words given as input by the user. You can refer to the following:
1. Intro: http://radimrehurek.com/gensim/models/word2vec.html
2. Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/
The syntax in Python for finding similarity between two words goes like this:
```
>> from gensim.models import Word2Vec
>> model = Word2Vec.load(path/to/your/model)
>> model.similarity('france', 'spain')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-12-12 18:53
I just stumbled on this while looking for how to do this by modifying the original distance.c version, not by using another library like gensim.

I didn't find an answer so I did some research, and am sharing it here for others who also want to know how to do it in the original implementation.

After looking through the C source, you will find that 'bi' is an array of indexes. If you provide two words, the index for word1 will be in bi[0] and the index of word2 will be in bi[1].

The model 'M' is an array of vectors. Each word is represented as a vector with dimension 'size'.

Using these two indexes and the model of vectors, look them up and calculate the cosine distance (which is the same as the dot product) like this:
```
dist = 0;
for (a = 0; a < size; a++) {
    dist += M[a + bi[0] * size] * M[a + bi[1] * size];
}
```
after this completes, the value 'dist' is the cosine similarity between the two words.
0 讨论(0)
发布评论:

提交评论
- 加载中...