Scrapy - how to identify already scraped urls

后端未结

关注

 5  1786

南笙 2020-12-05 08:28

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on

5条回答

悲哀的现实 (楼主)

2020-12-05 08:42
You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:
```
SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }
```
The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Finally, you'll need to modify your items.py so that each item class has the following fields:
```
visit_id = Field()
visit_status = Field()
```
And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

Good luck!
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...