Re-crawling websites fast

亡梦爱人 提交于 2019-12-03 20:46:58

I recommend using curl to fetch only the head and check if the Last-Modified header has changed.

Example:

 curl --head www.bankier.pl

For Nutch, I have written a blog post on how to re-crawl with Nutch. Basically, you should set a low value for the db.fetch.interval.default setting. On the next fetch of a url, Nutch will use the last fetch time as the value for the If-Modified-Since HTTP header.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!