Recrawl URL with Nutch just for updated sites

社会主义新天地 提交于 2019-12-18 15:52:43

问题


I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?


回答1:


Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.




回答2:


You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

Article describes the same in detail.




回答3:


what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle nutch

I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.



来源:https://stackoverflow.com/questions/14261586/recrawl-url-with-nutch-just-for-updated-sites

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!