I have many articles in a database (with title,text), I\'m looking for an algorithm to find the X most similar articles, something like Stack Overflow\'s \"Related Questions
If you are looking for words that wound alike, you could convert to soundex and the the soundex words to match ... worked for me
I tried some method but none works well.One may get a relatively satified result like this: First: get a Google SimHash code for every paragraph of all text and store it in databse. Second: Index for the SimHash code. Third: process your text to be compared as above,get a SimHash code and search all the text by SimHash index which apart form a Hamming distance like 5-10. Then compare simility with term vector. This may works for big data.
Seconding the Lucene suggestion for full-text, but note that java is not a requirement; a .NET port is available. Also see the main Lucene page for links to other projects, including Lucy, a C port.