How to limit number of followed pages per site in Python Scrapy
I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website . Here is my spider: class DownloadSpider(CrawlSpider): name = 'downloader' download_path = '/home/MyProjects/crawler' rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) def __init__(self, *args, **kwargs): super(DownloadSpider, self).__init__(*args, **kwargs) self.urls_file_path =