This is not working anymore, scrapy\'s API has changed.
Now the documentation feature a way to \"Run Scrapy from a script\" but I get the ReactorNotRestartable
To avoid ReactorNotRestartable error when running Scrapy in Celery Tasks Queue I've used threads. The same approach used to run Twisted reactor several times in one app. Scrapy also used Twisted, so we can do the same way.
Here is the code:
from threading import Thread
from scrapy.crawler import CrawlerProcess
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
class MyCrawler:
spider_settings = {}
def run_crawler(self):
process = CrawlerProcess(self.spider_settings)
process.crawl(MySpider)
Thread(target=process.start).start()
Don't forget to increase CELERYD_CONCURRENCY for celery.
CELERYD_CONCURRENCY = 10
works fine for me.
This is not blocking process running, but anyway scrapy best practice is to process data in callbacks. Just do this way:
for crawler in process.crawlers:
crawler.spider.save_result_callback = some_callback
crawler.spider.save_result_callback_params = some_callback_params
Thread(target=process.start).start()