Running Scrapy from a script - Hangs

后端 未结 1 918
没有蜡笔的小新
没有蜡笔的小新 2020-12-13 05:08

I\'m trying to run scrapy from a script as discussed here. It suggested using this snippet, but when I do it hangs indefinitely. This was written back in version .10; is it

相关标签:
1条回答
  • 2020-12-13 05:53
    from scrapy import signals, log
    from scrapy.xlib.pydispatch import dispatcher
    from scrapy.crawler import CrawlerProcess
    from scrapy.conf import settings
    from scrapy.http import Request
    
    def handleSpiderIdle(spider):
        '''Handle spider idle event.''' # http://doc.scrapy.org/topics/signals.html#spider-idle
        print '\nSpider idle: %s. Restarting it... ' % spider.name
        for url in spider.start_urls: # reschedule start urls
            spider.crawler.engine.crawl(Request(url, dont_filter=True), spider)
    
    mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': 'mybot.pipeline.validate.ValidateMyItem'} # global settings http://doc.scrapy.org/topics/settings.html
    
    settings.overrides.update(mySettings)
    
    crawlerProcess = CrawlerProcess(settings)
    crawlerProcess.install()
    crawlerProcess.configure()
    
    class MySpider(BaseSpider):
        start_urls = ['http://site_to_scrape']
        def parse(self, response):
            yield item
    
    spider = MySpider() # create a spider ourselves
    crawlerProcess.queue.append_spider(spider) # add it to spiders pool
    
    dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
    
    log.start() # depends on LOG_ENABLED
    print "Starting crawler."
    crawlerProcess.start()
    print "Crawler stopped."
    

    UPDATE:

    If you need to have also settings per spider see this example:

    for spiderConfig in spiderConfigs:
        spiderConfig = spiderConfig.copy() # a dictionary similar to the one with global settings above
        spiderName = spiderConfig.pop('name') # name of the spider is in the configs - i can use the same spider in several instances - giving them different names
        spiderModuleName = spiderConfig.pop('spiderClass') # module with the spider is in the settings
        spiderModule = __import__(spiderModuleName, {}, {}, ['']) # import that module
        SpiderClass = spiderModule.Spider # spider class is named 'Spider'
        spider = SpiderClass(name = spiderName, **spiderConfig) # create the spider with given particular settings
        crawlerProcess.queue.append_spider(spider) # add the spider to spider pool
    

    Example of settings in the file for spiders:

    name = punderhere_com    
    allowed_domains = plunderhere.com
    spiderClass = scraper.spiders.plunderhere_com
    start_urls = http://www.plunderhere.com/categories.php?
    
    0 讨论(0)
提交回复
热议问题