可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This is not working anymore, scrapy's API has changed.
Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable
error.
My task:
from celery import Task from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from scrapy.utils.project import get_project_settings from .spiders import MySpider class MyTask(Task): def run(self, *args, **kwargs): spider = MySpider settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run()
回答1:
The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post:
Running Scrapy spiders in a Celery task
This gets around the "reactor cannot be restart-able issue" by utilizing the multiprocessing package. But the problem with this is that the workaround is now obsolete with the latest celery version due to the fact that you will instead run into another issue where a daemon process can't spawn sub processes. So in order for the workaround to work you need to go down in celery version.
Yes, and the scrapy API has changed. But with minor modifications(import Crawler instead of CrawlerProcess). You can get the workaround to work by going down in celery version.
The Celery issue can be found here: Celery Issue #1709
Here is my updated crawl-script that works with newer celery versions by utilizing billiard instead of multiprocessing:
from scrapy.crawler import Crawler from scrapy.conf import settings from myspider import MySpider from scrapy import log, project from twisted.internet import reactor from billiard import Process from scrapy.utils.project import get_project_settings class UrlCrawlerScript(Process): def __init__(self, spider): Process.__init__(self) settings = get_project_settings() self.crawler = Crawler(settings) self.crawler.configure() self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed) self.spider = spider def run(self): self.crawler.crawl(self.spider) self.crawler.start() reactor.run() def run_spider(url): spider = MySpider(url) crawler = UrlCrawlerScript(spider) crawler.start() crawler.join()
Edit: By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!
Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.
回答2:
The Twisted reactor cannot be restarted, so once one spider finishes running and crawler
stops the reactor implicitly, that worker is useless.
As posted in the answers to that other question, all you need to do is kill the worker which ran your spider and replace it with a fresh one, which prevents the reactor from being started and stopped more than once. To do this, just set:
CELERYD_MAX_TASKS_PER_CHILD = 1
The downside is that you're not really using the Twisted reactor to its full potential and wasting resources running multiple reactors, as one reactor can run multiple spiders at once in a single process. A better approach is to run one reactor per worker (or even one reactor globally) and don't let crawler
touch it.
I'm working on this for a very similar project, so I'll update this post if I make any progress.
回答3:
To avoid ReactorNotRestartable error when running Scrapy in Celery Tasks Queue I've used threads. The same approach used to run Twisted reactor several times in one app. Scrapy also used Twisted, so we can do the same way.
Here is the code:
from threading import Thread from scrapy.crawler import CrawlerProcess import scrapy class MySpider(scrapy.Spider): name = 'my_spider' class MyCrawler: spider_settings = {} def run_crawler(self): process = CrawlerProcess(self.spider_settings) process.crawl(MySpider) Thread(target=process.start).start()
Don't forget to increase CELERYD_CONCURRENCY for celery.
CELERYD_CONCURRENCY = 10
works fine for me.
This is not blocking process running, but anyway scrapy best practice is to process data in callbacks. Just do this way:
for crawler in process.crawlers: crawler.spider.save_result_callback = some_callback crawler.spider.save_result_callback_params = some_callback_params Thread(target=process.start).start()
回答4:
I would say this approach is very inefficient if you have a lot of tasks to process. Because Celery is threaded - runs every task within its own thread. Let's say with RabbitMQ as a broker you can pass >10K q/s. With Celery this would potentially cause to 10K threads overhead! I would advise not to use celery here. Instead access the broker directly!