问题
I have two documents, for example:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
And I also know the similarity
(correlation) of each pair of words, e.g
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
What is the best way to measure the similarity of the two documents?
It seems that the traditional Jaccard distance
and cosine distance
are not a good metric in this situation.
回答1:
I like a book by Peter Christen on this issue.
Here he describes a Monge-Elkan similarity measure between two sets of strings. For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set. You can see its description on page 30 here.
来源:https://stackoverflow.com/questions/52090786/how-to-measure-the-similarity-of-two-documents-given-the-similarity-of-each-pa