How do Scrapy rules work with crawl spider

感情迁移 提交于 2019-12-02 17:36:38

You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter. You should use deny for correct matching of links:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

If you are from china, I have a chinese blog post about this:

别再滥用scrapy CrawlSpider中的follow=True


Let's check out how the rules work under the hood:

def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response)
                 if lnk not in seen]
        for link in links:
            seen.add(link)
            r = Request(url=link.url, callback=self._response_downloaded)
            yield r

as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.

and all the response will be handled by self._response_downloaded:

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _parse_response(self, response, callback, cb_kwargs, follow=True):

    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item

    # follow will go back to the rules again
    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item

and it goes back to the self._requests_to_follow(response) again and again.

In summary:

I would be tempted to use a BaseSpider scraper instead of a crawler. Using a basespider you can have more of a flow of intended request routes instead of finding ALL hrefs on the page and visiting them based on global rules. Use yield Requests() to continue looping through the parent sets of links and callbacks to pass the output object all the way to the end.

From your description:

I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.

A request callback stack like this would suit you very well. Since you know the order of the pages and which pages you need to scrape. This also has the added benefit of being able to collect information over multiple pages before returning the output object to be processed.

class Basketspider(BaseSpider, errorLog):
    name = "basketsp_test"
    download_delay = 0.5

    def start_requests(self):

        item = WhateverYourOutputItemIs()
        yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item})

    def parseSeaseonsLinks(self, response):

        item = response.meta['item'] 

        hxs = HtmlXPathSelector(response)

        html = hxs.extract()
        roundLinkList = list()

        roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS')

        for (roundLink) in re.findall(roundLinkPttern, html):
            if roundLink not in roundLinkList:
                roundLinkList.append(roundLink)        

        for i in range(len(roundLinkList)):

            #if you wanna output this info in the final item
            item['RoundLink'] = roundLinkList[i]

            # Generate new request for round page
            yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item})


    def parseRoundPAge(self, response):

        item = response.meta['item'] 
        #Do whatever you need to do in here call more requests if needed or return item here

        item['Thing'] = 'infoOnPage'
        #....
        #....
        #....

        return  item
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!