Python Scrapy 301 redirects

不打扰是莪最后的温柔 提交于 2021-02-17 20:53:23


I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is:

import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'rust'
    allowed_domains = ['']
    start_urls = ['']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(), callback='parse_item', follow=True),

    def parse_item(self, response):
        #if response.status == 301:
        print response.url

However, this does not print the redirected urls. Any help will be appreciated.

Thank you.


To parse any responses that are not 200 you'd need to do one of these things:


You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.


Add handle_httpstatus_list parameter to your spider. In your case something like:

class MySpider(scrapy.Spider):
    handle_httpstatus_list = [301]
    # or 
    handle_httpstatus_all = True


You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:

scrapy.request('', meta={'handle_httpstatus_list': [301]})

To learn more see HttpErrorMiddleware

