Scrapy crawl from script always blocks script execution after scraping

后端 未结 2 1988
一整个雨季
一整个雨季 2020-11-30 06:47

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:



        
2条回答
  •  我在风中等你
    2020-11-30 07:14

    You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal:

    from twisted.internet import reactor
    
    from scrapy import log, signals
    from scrapy.crawler import Crawler
    from scrapy.settings import Settings
    from scrapy.xlib.pydispatch import dispatcher
    
    from testspiders.spiders.followall import FollowAllSpider
    
    def stop_reactor():
        reactor.stop()
    
    dispatcher.connect(stop_reactor, signal=signals.spider_closed)
    spider = FollowAllSpider(domain='scrapinghub.com')
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    log.msg('Running reactor...')
    reactor.run()  # the script will block here until the spider is closed
    log.msg('Reactor stopped.')
    

    And the command line log output might look something like:

    stav@maia:/srv/scrapy/testspiders$ ./api
    2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
    2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
    2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 23934,...}
    2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
    2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
    stav@maia:/srv/scrapy/testspiders$
    

提交回复
热议问题