Measuring similarity between document sets

回眸只為那壹抹淺笑 提交于 2019-12-10 12:56:21

问题


For illustration purposes, let's assume this is a forum service. I need to calculate the "similarity" among each users' posts, so that the result would be something like:

among posts by user A, similarity 60%
among posts by user B, similarity 20%
...

I'm dealing with multibyte strings, so I guess I'm stuck with search engines here. We already use Solr, already have moreLikeThis implemented, but I'm not quite sure how to construct the query. Any help appreciated!


回答1:


Possibly Carrot2 will interest you (and this blog related to it)




回答2:


strange question in two ways: 1. Why do you have to deal with SOLR? 2. The kind of similarity depends on the target problem. Your question sounds too generic to me. There is research going on in the area of semantic similarity. There is edit-distance algorithm, which is probably not what you want.

So, define you question more precisely and you get better answers.




回答3:


There are several measures of similarity, a simple and effective one is Cosine similarity. There are more sophisticated ones such as Smith-Waterman etc,

Look at http://sourceforge.net/projects/simmetrics/



来源:https://stackoverflow.com/questions/6069922/measuring-similarity-between-document-sets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!