The official docs give many ways for running scrapy
crawlers from code:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(
Netimen's answer is correct. process.start()
calls reactor.run()
, which blocks the thread. Just that I don't think it is necessary to subclass billiard.Process
. Although poorly documented, billiard.Process
does have a set of APIs to call another function asynchronously without subclassing.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from billiard import Process
crawler = CrawlerProcess(get_project_settings())
process = Process(target=crawler.start, stop_after_crawl=False)
def crawl(*args, **kwargs):
crawler.crawl(*args, **kwargs)
process.start()
Note that if you don't have stop_after_crawl=False
, you may run into ReactorNotRestartable
exception when you run the crawler for more than twice.