Run a Scrapy spider in a Celery Task

后端 未结 4 1916
刺人心
刺人心 2020-11-28 04:13

This is not working anymore, scrapy\'s API has changed.

Now the documentation feature a way to \"Run Scrapy from a script\" but I get the ReactorNotRestartable

4条回答
  •  臣服心动
    2020-11-28 05:02

    To avoid ReactorNotRestartable error when running Scrapy in Celery Tasks Queue I've used threads. The same approach used to run Twisted reactor several times in one app. Scrapy also used Twisted, so we can do the same way.

    Here is the code:

    from threading import Thread
    from scrapy.crawler import CrawlerProcess
    import scrapy
    
    class MySpider(scrapy.Spider):
        name = 'my_spider'
    
    
    class MyCrawler:
    
        spider_settings = {}
    
        def run_crawler(self):
    
            process = CrawlerProcess(self.spider_settings)
            process.crawl(MySpider)
            Thread(target=process.start).start()
    

    Don't forget to increase CELERYD_CONCURRENCY for celery.

    CELERYD_CONCURRENCY = 10
    

    works fine for me.

    This is not blocking process running, but anyway scrapy best practice is to process data in callbacks. Just do this way:

    for crawler in process.crawlers:
        crawler.spider.save_result_callback = some_callback
        crawler.spider.save_result_callback_params = some_callback_params
    
    Thread(target=process.start).start()
    

提交回复
热议问题