How does Google News and Techmeme cluster news items that are similar? Are there any well know algorithm that is used to achieve this?
Appreciate your help.
Than
There's a few different ways to do it. The standard is to do a "bag of words" analysis (weighted TF-IDF), and then do cosine similarity and k-means.
I've had success with this paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851
The great thing about it is: 1) It's incremental, which is great for news. With standard k-means, you need to have the entire data set. With news, you usually have articles arriving over time. Incremental algorithms solve that. 2) It's phrase-based. So it relies on phrases rather than just words.
Recently, there have been techniques that use semantic meaning instead of words (for instance, by extracting Wikipedia or DBPedia concepts from each article, and using that instead of just words).