I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:
If you are from china, I have a chinese blog post about this:
别再滥用scrapy CrawlSpider中的follow=True
Let's check out how the rules work under the hood:
def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
yield r
as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.
and all the response will be handled by self._response_downloaded
:
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
# follow will go back to the rules again
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
and it goes back to the self._requests_to_follow(response)
again and again.
In summary: