how to handle 302 redirect in scrapy

后端 未结 6 498
时光说笑
时光说笑 2020-11-29 09:32

I am receiving a 302 response from a server while scrapping a website:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to 

        
6条回答
  •  囚心锁ツ
    2020-11-29 10:07

    An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.

    You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.

    To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).

    When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:

    class MySpider(Spider):
        name = 'my_spider'
    
        max_retries = 2
    
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.retries = {}
    
        def start_requests(self):
            yield Request(
                'https://example.com',
                callback=self.parse,
                meta={
                    'handle_httpstatus_list': [302],
                },
            )
    
        def parse(self, response):
            if response.status == 302:
                retries = self.retries.setdefault(response.url, 0)
                if retries < self.max_retries:
                    self.retries[response.url] += 1
                    yield response.request.replace(dont_filter=True)
                else:
                    self.logger.error('%s still returns 302 responses after %s retries',
                                      response.url, retries)
                return
    

    Depending on the scenario, you might want to move this code to a downloader middleware.

提交回复
热议问题