How do websites like torrentz.eu collect their content?

别等时光非礼了梦想. 提交于 2019-12-10 10:02:59

问题


I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?


回答1:


You are looking for the Crawling aspect of Information Retrieval.

Basically crawling is: Given an initial set S of websites, try to expand it by exploring the links (Find the transitive closure1).

Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.

P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.

P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.


(1) Due to time issues, the transitive closure is usually not found, but something close enough to it.



来源:https://stackoverflow.com/questions/12405967/how-do-websites-like-torrentz-eu-collect-their-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!