Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on
You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/
To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:
SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }
The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
Finally, you'll need to modify your items.py so that each item class has the following fields:
visit_id = Field()
visit_status = Field()
And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.
Good luck!