I\'m working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the refer
You can pass REFERER
manually to each request using headers
argument:
yield Request(parse=..., headers={'referer':...})
RefererMiddleware does the same, automatically taking the referrer url from the previous response.
I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware
In short, you need to add this middleware to your project's settings file.
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
Then in your response parsing method you can use, response.request.headers.get('Referrer', None)
, to get the referer.
If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.