问题
I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?
回答1:
You are looking for the Crawling aspect of Information Retrieval.
Basically crawling is: Given an initial set S
of websites, try to expand it by exploring the links (Find the transitive closure1).
Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.
P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.
P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.
(1) Due to time issues, the transitive closure is usually not found, but something close enough to it.
来源:https://stackoverflow.com/questions/12405967/how-do-websites-like-torrentz-eu-collect-their-content