Algorithm to find articles with similar text

后端 未结 15 2430
梦谈多话
梦谈多话 2020-11-28 18:10

I have many articles in a database (with title,text), I\'m looking for an algorithm to find the X most similar articles, something like Stack Overflow\'s \"Related Questions

相关标签:
15条回答
  • 2020-11-28 18:47

    If you are looking for words that wound alike, you could convert to soundex and the the soundex words to match ... worked for me

    0 讨论(0)
  • 2020-11-28 18:48

    I tried some method but none works well.One may get a relatively satified result like this: First: get a Google SimHash code for every paragraph of all text and store it in databse. Second: Index for the SimHash code. Third: process your text to be compared as above,get a SimHash code and search all the text by SimHash index which apart form a Hamming distance like 5-10. Then compare simility with term vector. This may works for big data.

    0 讨论(0)
  • 2020-11-28 18:52

    Seconding the Lucene suggestion for full-text, but note that java is not a requirement; a .NET port is available. Also see the main Lucene page for links to other projects, including Lucy, a C port.

    0 讨论(0)
提交回复
热议问题