Scrapy: How to limit number of urls scraped in SitemapSpider

问题

I'm working on a sitemap spider. This spider gets one sitemap url and scrape all urls in this sitemap. I want to limit the number of urls to 100.

I can't use CLOSESPIDER_PAGECOUNT because I use XML export pipeline. It seems that when scrapy gets to the pagecount, it stops everything including XML generating. So the XML file is not closed etc. it's invalid.

class MainSpider(SitemapSpider):
    name = 'main_spider'
    allowed_domains = ['doman.com']
    sitemap_urls = ['http://doman.com/sitemap.xml']

    def start_requests(self):
        for url in self.sitemap_urls:
            yield Request(url, self._parse_sitemap)


    def parse(self, response):
        print u'URL: {}'.format(response.url)
        if self._is_product(response):
            URL = response.url
            ITEM_ID = self._extract_code(response)

    ...

Do you know what to do?

回答1:

Using on return was not enough for me, but you can combine it with the CloseSpider exception :

# To import it :
from scrapy.exceptions import CloseSpider


#Later to use it:
raise CloseSpider('message')

I posted the whole code combining both on stackoverflow here

回答2:

Why not have a count property on the spider initialized to 0, and the on your parse method you can

def parse(self, response):
    if self.count >= 100:
         return
    self.count += 1
    # do actual parsing here

来源：https://stackoverflow.com/questions/47126973/scrapy-how-to-limit-number-of-urls-scraped-in-sitemapspider

标签

python

scrapy

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!