Strategy for how to crawl/index frequently updated webpages?

后端 未结 4 1661
别跟我提以往
别跟我提以往 2021-01-30 09:57

I\'m trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store an

4条回答
  •  忘了有多久
    2021-01-30 10:20

    Google's algorithms are mostly closed, they won't tell how they do it.

    I built a crawler using the concept of a directed graph and based the re-crawl rate on pages' degree centrality. You could consider a website to be a directed graph with pages as nodes and hyperlinks as edges. A node with high centrality will probably be a page that is updated more often. At least, that is the assumption.

    This can be implemented by storing URLs and the links between them. If you crawl and don't throw away any links, the graph per site will grow. Calculating for every node per site the (normalised) in- and outdegree will then give you a measure of which page is most interesting to re-crawl more often.

提交回复
热议问题