发表新帖

发表新帖

Recrawl URL with Nutch just for updated sites

前端未结

关注

 3  1726

生来不讨喜 2020-12-31 14:42

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

3条回答

忘掉有多难 (楼主)

2020-12-31 15:29

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题