How do websites like torrentz.eu collect their content?

问题

I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?

回答1:

You are looking for the Crawling aspect of Information Retrieval.

Basically crawling is: Given an initial set S of websites, try to expand it by exploring the links (Find the transitive closure¹).

Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.

P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.

P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.

(1) Due to time issues, the transitive closure is usually not found, but something close enough to it.

来源：https://stackoverflow.com/questions/12405967/how-do-websites-like-torrentz-eu-collect-their-content

标签

web

search-engine

business-intelligence

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!