Strategy for how to crawl/index frequently updated webpages?

后端 未结 4 1629
别跟我提以往
别跟我提以往 2021-01-30 09:57

I\'m trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store an

4条回答
  •  半阙折子戏
    2021-01-30 10:05

    I'm not an expert in this topic by any stretch of the imagination but Sitemaps are one way to alleviate this problem.

    In its simplest terms, a XML Sitemap—usually called Sitemap, with a capital S—is a list of the pages on your website. Creating and submitting a Sitemap helps make sure that Google knows about all the pages on your site, including URLs that may not be discoverable by Google's normal crawling process. In addition, you can also use Sitemaps to provide Google with metadata about specific types of content on your site, including video, images, mobile, and News.

    Google uses this specifically to help them crawl news sites. You can find more info here on Sitemaps and info about Google News and Sitemaps here.

    Usually, you can find the Sitemaps.xml in a website's robots.txt. For example, TechCrunch's Sitemap is just

    http://techcrunch.com/sitemap.xml

    which turns this problem into parsing xml on a regular basis. If you can't find it in the robots.txt, you can always contact the web master and see if they'll provide it to you.

    UPDATE 1 Oct 24 2012 10:45 AM,

    I spoke with one of my team members and he gave me some additional insight about how we handle this problem. I want really reiterate that this isn't a simple issue and requires a lot of partial solutions.

    Another thing we do is monitor several "index pages" for changes on a given domain. Take the New York Times for example. We create one index page for a the top level domain at:

    http://www.nytimes.com/

    If you take a look at the page, you can notice additional sub areas like World, US, Politics, Business, etc. We create additional index pages for all of them. Business has additional nested index pages like Global, DealBook, Markets, Economy, etc. It isn't uncommon for a url to have 20 plus index pages. If we notice any additional urls that are added on the index, we add them to the queue to crawl.

    Obviously this is very frustrating because you may have to do this by hand for every website you want crawl. You may want to consider paying for a solution. We use SuprFeedr and are quite happy with it.

    Also, many websites still offer RSS which is an effective way of crawling pages. I would still recommend contacting a webmaster to see if they have any simple solution to help you out.

提交回复
热议问题