How do Scrapy rules work with crawl spider

前端未结

关注

 3  1586

一个人的身影 2021-01-31 19:38

I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:

I don\'t understand ho

3条回答

無奈伤痛 (楼主)

2021-01-31 19:55

If you are from china, I have a chinese blog post about this:

别再滥用scrapy CrawlSpider中的follow=True

Let's check out how the rules work under the hood:

def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response)
                 if lnk not in seen]
        for link in links:
            seen.add(link)
            r = Request(url=link.url, callback=self._response_downloaded)
            yield r

as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.

and all the response will be handled by self._response_downloaded:

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _parse_response(self, response, callback, cb_kwargs, follow=True):

    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item

    # follow will go back to the rules again
    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item

and it goes back to the self._requests_to_follow(response) again and again.

In summary:

0 讨论(0)

查看其它3个回答