How to handle a 429 Too Many Requests response in Scrapy?

后端 未结 3 1052
深忆病人
深忆病人 2020-12-28 22:54

I\'m trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <42         


        
3条回答
  •  臣服心动
    2020-12-28 23:10

    You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    from scrapy.utils.response import response_status_message
    
    import time
    
    class TooManyRequestsRetryMiddleware(RetryMiddleware):
    
        def __init__(self, crawler):
            super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
            self.crawler = crawler
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_response(self, request, response, spider):
            if request.meta.get('dont_retry', False):
                return response
            elif response.status == 429:
                self.crawler.engine.pause()
                time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
                self.crawler.engine.unpause()
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            elif response.status in self.retry_http_codes:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            return response 
    

    Add 429 to retry codes in settings.py

    RETRY_HTTP_CODES = [429]
    

    Then activate it on settings.py. Don't forget to deactivate the default retry middleware.

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
        'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
    }
    

提交回复
热议问题