How do Scrapy rules work with crawl spider

前端 未结 3 1575
一个人的身影
一个人的身影 2021-01-31 19:38

I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:

  1. I don\'t understand ho
3条回答
  •  無奈伤痛
    2021-01-31 19:55

    If you are from china, I have a chinese blog post about this:

    别再滥用scrapy CrawlSpider中的follow=True


    Let's check out how the rules work under the hood:

    def _requests_to_follow(self, response):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                yield r
    

    as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.

    and all the response will be handled by self._response_downloaded:

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
    
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
    
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
    
        # follow will go back to the rules again
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item
    

    and it goes back to the self._requests_to_follow(response) again and again.

    In summary:

提交回复
热议问题