How does Google find relevant content when it\'s parsing the web?
Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t
I don't work at Google but around a year ago I read they had over 200 factors for ranking their search results. Of course the top ranking would be relevance, so your question is quite interesting in that sense.
What is relevance and how do you calculate it? There are several algorithms and I bet Google have their own, but ones I'm aware of are Pearson Correlation and Euclidean Distance.
A good book I'd suggest on this topic (not necessarily search engines) is Programming Collective Intelligence by Toby Segaran (O'Reilly). A few samples from the book show how to fetch data from third-party websites via APIs or screen-scraping, and finding similar entries, which is quite nice.
Anyways, back to Google. Other relevance techniques are of course full-text searching and you may want to get a good book on MySQL or Sphinx for that matter. Suggested by @Chaoley was TSEP which is also quite interesting.
But really, I know people from a Russian search engine called Yandex here, and everything they do is under NDA, so I guess you can get close, but you cannot get perfect, unless you work at Google ;)
Cheers.