Scrapy - how to identify already scraped urls

后端 未结 5 1786
南笙
南笙 2020-12-05 08:28

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on

5条回答
  •  悲哀的现实
    2020-12-05 08:42

    You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

    To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

    SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }
    

    The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

    Finally, you'll need to modify your items.py so that each item class has the following fields:

    visit_id = Field()
    visit_status = Field()
    

    And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

    Good luck!

提交回复
热议问题