How to force scrapy to crawl duplicate url?

后端 未结 2 1511
遥遥无期
遥遥无期 2020-12-14 17:00

I am learning Scrapy a web crawling framework.
by default it does not crawl duplicate urls or urls which scrapy have already crawled.

How to make Scrapy to cra

相关标签:
2条回答
  • 2020-12-14 17:26

    You're probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

    0 讨论(0)
  • 2020-12-14 17:40

    A more elegant solution is to disable the duplicate filter altogether:

    # settings.py
    DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
    

    This way you don't have to clutter all your Request creation code with dont_filter=True. Another side effect: this only disables duplicate filtering and not any other filters like offsite filtering.

    If you want to use this setting selectively for only one or some of multiple spiders in your project, you can set it via custom_settings in the spider implementation:

    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        custom_settings = {
            'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
        }
    
    0 讨论(0)
提交回复
热议问题