scrapy spider add health check before starting spider

六眼飞鱼酱① 提交于 2020-05-31 05:42:10

问题


I would like to not start the spider job if the external depended APIs(cassandra, mysql etc) are not reachable

classHealthCheck:
    @staticmethod
    def is_healthy():
        config = json.loads(configHelper.get_data())
        cassandra_config = config['cassandra']
        cluster = Cluster(cassandra_config['hosts'],
                          port=cassandra_config['port'])
        session = cluster.connect(cassandra_config['keyspace'])
        try:
            session.execute('SELECT 1')
        except Exception as e:
            logging.error(e)
        return True

I can invoke the is_healthy inside the init method of the spider, but this I have to do for all spiders. Does anyone have any better suggestion from where to invoke the is_healthy?


回答1:


This is not an easy task, see for example this issue. The problem is that you can't close the spider right after it is opened because it might happen before the engine is started (see here). However, there seems to be a solution, although a bit hacky. Here's a working prototype as a Scrapy extension:

import logging

from scrapy import signals
from twisted.internet import task


logger = logging.getLogger(__name__)


class HealthcheckExtension(object):
    """Close spiders if healthcheck fails"""

    def __init__(self, crawler):
        self.crawler = crawler
        crawler.signals.connect(self.engine_started, signal=signals.engine_started)
        crawler.signals.connect(self.engine_stopped, signal=signals.engine_stopped)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def engine_started(self):
        healthy = self.perform_healthcheck()
        if not healthy:
            logger.info('Healthcheck failed, closing all spiders')
            self.task = task.LoopingCall(self.close_spiders)
            self.task.start(0.0001, now=True)

    def engine_stopped(self):
        task = getattr(self, 'task', False)
        if task and task.running:
            task.stop()

    def perform_healthcheck(self):
        # perform the health check here and return True if passes
        return False  # simulate failed healthcheck...

    def close_spiders(self):
        if self.crawler.engine.running:
            for spider in self.crawler.engine.open_spiders:
                self.crawler.engine.close_spider(spider, 'healthcheck_failed')

It performs the health check in the engine_started signal handler. If it fails, it creates a periodic task (with as short loop interval as possible) that tries to close the spiders as soon as possible (after engine starts).

Enable the extension in settings.py:

EXTENSIONS = {
    'demo.extensions.HealthcheckExtension': 100
}

and run arbitrary spider. It closes immediately with appropriate finish_reason:

2020-02-29 17:17:43 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: demo)
2020-02-29 17:17:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-5.3.0-40-generic-x86_64-with-Ubuntu-18.04-bionic
2020-02-29 17:17:43 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'demo', 'NEWSPIDER_MODULE': 'demo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['demo.spiders']}
2020-02-29 17:17:43 [scrapy.extensions.telnet] INFO: Telnet Password: 8253cb10ff171340
2020-02-29 17:17:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'demo.extensions.HealthcheckExtension']
2020-02-29 17:17:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-29 17:17:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-29 17:17:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-29 17:17:43 [scrapy.core.engine] INFO: Spider opened
2020-02-29 17:17:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-29 17:17:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-29 17:17:43 [demo.extensions] INFO: Healthcheck failed, closing all spiders
2020-02-29 17:17:43 [scrapy.core.engine] INFO: Closing spider (healthcheck_failed)
2020-02-29 17:17:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005618,
 'finish_reason': 'healthcheck_failed',
 'finish_time': datetime.datetime(2020, 2, 29, 16, 17, 43, 766734),
 'log_count/INFO': 11,
 'memusage/max': 52596736,
 'memusage/startup': 52596736,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 2, 29, 16, 17, 43, 761116)}
2020-02-29 17:17:43 [scrapy.core.engine] INFO: Spider closed (healthcheck_failed)


来源:https://stackoverflow.com/questions/60461028/scrapy-spider-add-health-check-before-starting-spider

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!