scrapy how to set referer url

前端 未结 4 1931
旧时难觅i
旧时难觅i 2020-12-16 17:54

I need to set the referer url, before scraping a site, the site uses refering url based Authentication, so it does not allow me to login if the referer is not valid.

相关标签:
4条回答
  • 2020-12-16 18:31

    You should do exactly as @warwaruk indicated, below is my example elaboration for a crawl spider:

    from scrapy.spiders import CrawlSpider
    from scrapy import Request
    
    class MySpider(CrawlSpider):
      name = "myspider"
      allowed_domains = ["example.com"]
      start_urls = [
          'http://example.com/foo'
          'http://example.com/bar'
          'http://example.com/baz'
          ]
      rules = [(...)]
    
      def start_requests(self):
        requests = []
        for item in self.start_urls:
          requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
        return requests    
    
      def parse_me(self, response):
        (...)
    

    This should generate following logs in your terminal:

    (...)
    [myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
    (...)
    [myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
    (...)
    [myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
    (...)
    

    Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from.

    Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

    0 讨论(0)
  • 2020-12-16 18:34

    If you want to change the referer in your spider's request, you can change DEFAULT_REQUEST_HEADERS in the settings.py file:

    DEFAULT_REQUEST_HEADERS = {
        'Referer': 'http://www.google.com' 
    }
    
    0 讨论(0)
  • 2020-12-16 18:34

    Override BaseSpider.start_requests and create there your custom Request passing it your referer header.

    0 讨论(0)
  • 2020-12-16 18:47

    Just set Referer url in the Request headers

    class scrapy.http.Request(url[, method='GET', body, headers, ...

    headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).

    Example:

    return Request(url=your_url, headers={'Referer':'http://your_referer_url'})

    0 讨论(0)
提交回复
热议问题