How to force scrapy to crawl duplicate url?

老子叫甜甜 提交于 2019-11-27 13:53:37

问题


I am learning Scrapy a web crawling framework.
by default it does not crawl duplicate urls or urls which scrapy have already crawled.

How to make Scrapy to crawl duplicate urls or urls which have already crawled?
I tried to find out on internet but could not find relevant help.

I found DUPEFILTER_CLASS = RFPDupeFilter and SgmlLinkExtractor from Scrapy - Spider crawls duplicate urls but this question is opposite of what I am looking


回答1:


You're probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects




回答2:


A more elegant solution is to disable the duplicate filter altogether:

# settings.py
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'

This way you don't have to clutter all your Request creation code with dont_filter=True. Another side effect: this only disables duplicate filtering and not any other filters like offsite filtering.

If you want to use this setting selectively for only one or some of multiple spiders in your project, you can set it via custom_settings in the spider implementation:

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
    }


来源:https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!