Following links, Scrapy web crawler framework

前端 未结 2 1477
情书的邮戳
情书的邮戳 2020-12-13 03:21

After several readings to Scrapy docs I\'m still not catching the diferrence between using CrawlSpider rules and implementing my own link extraction mechanism on the callbac

2条回答
  •  [愿得一人]
    2020-12-13 03:52

    CrawlSpider inherits BaseSpider. It just added rules to extract and follow links. If these rules are not enough flexible for you - use BaseSpider:

    class USpider(BaseSpider):
        """my spider. """
    
        start_urls = ['http://www.amazon.com/s/?url=search-alias%3Dapparel&sort=relevance-fs-browse-rank']
        allowed_domains = ['amazon.com']
    
        def parse(self, response):
            '''Parse main category search page and extract subcategory search link.'''
            self.log('Downloaded category search page.', log.DEBUG)
            if response.meta['depth'] > 5:
                self.log('Categories depth limit reached (recursive links?). Stopping further following.', log.WARNING)
    
            hxs = HtmlXPathSelector(response)
            subcategories = hxs.select("//div[@id='refinements']/*[starts-with(.,'Department')]/following-sibling::ul[1]/li/a[span[@class='refinementLink']]/@href").extract()
            for subcategory in subcategories:
                subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
                yield Request(subcategorySearchLink, callback = self.parseSubcategory)
    
        def parseSubcategory(self, response):
            '''Parse subcategory search page and extract item links.'''
            hxs = HtmlXPathSelector(response)
    
            for itemLink in hxs.select('//a[@class="title"]/@href').extract():
                itemLink = urlparse.urljoin(response.url, itemLink)
                self.log('Requesting item page: ' + itemLink, log.DEBUG)
                yield Request(itemLink, callback = self.parseItem)
    
            try:
                nextPageLink = hxs.select("//a[@id='pagnNextLink']/@href").extract()[0]
                nextPageLink = urlparse.urljoin(response.url, nextPageLink)
                self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
                yield Request(nextPageLink, callback = self.parseSubcategory)
            except:
                self.log('Whole category parsed: ' + categoryPath, log.DEBUG)
    
        def parseItem(self, response):
            '''Parse item page and extract product info.'''
    
            hxs = HtmlXPathSelector(response)
            item = UItem()
    
            item['brand'] = self.extractText("//div[@class='buying']/span[1]/a[1]", hxs)
            item['title'] = self.extractText("//span[@id='btAsinTitle']", hxs)
            ...
    

    Even if BaseSpider's start_urls are not enough flexible for you, override start_requests method.

提交回复
热议问题