Recrawl URL with Nutch just for updated sites

前端 未结 3 1722
生来不讨喜
生来不讨喜 2020-12-31 14:42

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

3条回答
  •  半阙折子戏
    2020-12-31 15:28

    what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

    This is discussed on : How to recrawle nutch

    I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

提交回复
热议问题