how to process all kinds of exception in a scrapy project, in errback and callback?

前端 未结 2 1765
北恋
北恋 2020-12-14 12:02

I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result

相关标签:
2条回答
  • 2020-12-14 12:33

    At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.

    Yes, you are right - callback and errback are meant to be used only with downloader, as twisted is used for downloading a resource, and twisted uses deffereds - that's why callbacks are needed.

    The only async part in scrapy usually is downloader, all the other parts working synchronously.

    So, if you want to catch all non-downloader errors - do it yourself:

    • make a big try/except in the callback
    • or make a decorator for your callbacks which will do this (i like this idea more)
    0 讨论(0)
  • 2020-12-14 12:43

    EDIT 16 nov 2012: Scrapy >=0.16 uses a different method to attach methods to signals, extra example added

    The most simple solution would be to write an extension in which you capture failures, using Scrapy signals. For example; the following extension will catch all errors and print a traceback.

    You could do anything with the Failure - like save to your database, or send an email - which itself is an instance of twisted.python.failure.Failure.

    For Scrapy versions till 0.16:

    from scrapy import signals
    from scrapy.xlib.pydispatch import dispatcher
    
    class FailLogger(object):
      def __init__(self):
        """ 
        Attach appropriate handlers to the signals
        """
        dispatcher.connect(self.spider_error, signal=signals.spider_error)
    
      def spider_error(self, failure, response, spider):
        print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())
    

    For Scrapy versions from 0.16 and up:

    from scrapy import signals
    
    class FailLogger(object):
    
      @classmethod
      def from_crawler(cls, crawler):
        ext = cls()
    
        crawler.signals.connect(ext.spider_error, signal=signals.spider_error)
    
        return ext
    
      def spider_error(self, failure, response, spider):
        print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())  
    

    You would enable the extension in the settings, with something like:

    EXTENSIONS = {
    'spiders.extensions.faillog.FailLogger': 599,
    }
    
    0 讨论(0)
提交回复
热议问题