Recrawl URL with Nutch just for updated sites

前端 未结 3 1709
生来不讨喜
生来不讨喜 2020-12-31 14:42

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

3条回答
  •  無奈伤痛
    2020-12-31 15:09

    You have to Schedule ta Job for Firing the Job
    However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

    Article describes the same in detail.

提交回复
热议问题