How to measure the similarity of two documents , given the similarity of each pair of words?

问题

I have two documents, for example:

Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}

And I also know the similarity(correlation) of each pair of words, e.g

Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1

What is the best way to measure the similarity of the two documents?

It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.

回答1:

I like a book by Peter Christen on this issue.

Here he describes a Monge-Elkan similarity measure between two sets of strings. For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set. You can see its description on page 30 here.

来源：https://stackoverflow.com/questions/52090786/how-to-measure-the-similarity-of-two-documents-given-the-similarity-of-each-pa

标签

python-3.x

nlp

similarity

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!